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Abstract 


Diagonally  scaling  a  matrix  often  reduces  its  condition  number.  Equilibration  scales  a  matrix  so 
that  the  row  and  column  norms  are  equal.  We  review  the  existence  and  uniqueness  theory  for 
exact  equilibration.  Then  we  introduce  a  formalization  of  approximate  equilibration  and  develop  its 
existence  and  uniqueness  theory.  Next  we  develop  approximate  equilibration  algorithms  that  access 
a  matrix  only  by  matrix-vector  products.  We  address  both  the  nonsymmetric  and  symmetric  cases. 

Limited-memory  quasi-Newton  methods  may  be  thought  of  as  changing  the  metric  so  that  the 
steepest-descent  method  works  effectively  on  the  problem.  Quasi-Newton  methods  construct  a  ma¬ 
trix  using  vectors  of  two  types  involving  the  iterates  and  gradients.  The  vectors  are  related  by 
an  approximate  matrix- vector  product.  Using  our  approximate  matrix- free  symmetric  equilibration 
method,  we  develop  a  limited-memory  quasi-Newton  method  in  which  one  part  of  the  quasi-Newton 
matrix  approximately  equilibrates  the  Hessian. 

Often  a  differential  equation  is  solved  by  discretizing  it  on  a  sequence  of  increasingly  fine  meshes. 
This  technique  can  be  used  when  solving  differential-equation-constrained  optimization  problems. 
We  describe  a  method  to  interpolate  our  limited-memory  quasi-Newton  matrix  from  a  coarse  to  a 
fine  mesh. 
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Chapter  1 

Introduction 


Quasi-Newton  methods 

Consider  the  unconstrained  quadratic  program  min^  f(x)  for  f(x)  =  ^ xT Ax  +  bTx.  The  gradient 
is  g(x)  =  V f(x)  =  Ax  +  b.  In  a  quasi-Newton  (QN)  method,  an  approximation  to  the  Hessian  is 
constructed  by  assembling  pairs  {s,  y)  into  a  matrix,  s  is  the  difference  between  iterates:  s  =  x+  —  x; 
y  is  the  difference  between  the  gradients  at  the  iterates:  y  =  g+  —  g.  x+  occurs  at  the  iteration 
subsequent  to  x,  and  similarly  for  other  quantities  using  the  superscript—)-  notation.  Observe  that 
As  =  y  and  so  s  and  y  are  related  by  a  matrix- vector  product  involving  the  Hessian.  For  a  general 
optimization  problem,  the  two  are  approximately  related  by  a  matrix- vector  product. 

On  an  unconstrained  convex  quadratic  program,  many  quasi-Newton  methods  obtain  the  Hessian 
after  a  finite  number  of  iterations.  In  real  applications — for  example,  on  nonconvex,  constrained 
problems;  and  on  very  large  problems  for  which  the  storage  for  a  quasi-Newton  matrix  is  limited — 
the  Hessian  is  not  recovered.  A  QN  method  may  be  thought  of  as  changing  the  metric  of  the 
problem  while  the  problem  is  being  solved.  Search  directions  are  then  steepest-descent  directions 
for  the  modified  problem  corresponding  to  the  current  QN  matrix. 

A  matrix  Bq  initializes  the  QN  matrix.  Most  often,  Bq  is  chosen  to  be  a  diagonal  matrix.  An 
algorithm  can  change  B0  as  often  as  at  each  update  of  the  QN  matrix.  Section  3.1.2  shows  that  in 
Broyden-class  QN  methods,  a  diagonal  Bq  may  be  interpreted  as  scaling  the  problem.  Therefore, 
an  effective  choice  for  Bq  is  a  scaling  matrix  that  minimizes  the  condition  number  of  the  Hessian. 

The  time  a  solver  requires  to  find  a  solution  is  a  key  performance  criterion.  Time  may  be  divided 
into  two  parts:  time  spent  in  the  solver,  chiefly  on  linear  algebra  operations;  and  time  spent  in  the 
user  function ,  which  is  the  software  client’s  implementation  of  the  problem.  For  many  problems, 
the  time  required  by  the  solver  is  considerably  less  than  the  time  required  by  the  user  function. 
Hence  an  optimization  algorithm  is  often  developed  with  the  objective  of  minimizing  the  number  of 
user  function  calls.  Improving  the  particular  QN  method  an  optimization  software  package  uses  can 
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CHAPTER  1.  INTRODUCTION 


reduce  the  number  of  user  function  calls  required  to  solve  a  problem. 

Matrix  equilibration 

The  performance  of  many  algorithms,  particularly  iterative  ones,  depends  on  the  condition  number 
of  an  underlying  matrix.  For  example,  the  convergence  rate  of  the  steepest-descent  method  and 
Krylov-subspace  methods  depends  on  the  condition  number  relevant  to  the  problem.  Hence  a  good 
scaling  method  substantially  reduces  the  condition  number  of  the  matrix. 

If  a  matrix  is  symmetric  positive  definite,  scaling  it  to  unit  diagonal  is  effective  and  always 
possible.  If  a  matrix  is  indefinite,  such  scaling — substituting,  say,  1  for  a  0  diagonal  element — is  less 
often  effective.  Far  more  effective  in  the  latter  case  is  scaling  the  matrix  so  that  its  row  and  column 
norms  are  all  equal.  This  scaling  procedure  is  called  equilibration.  If  a  matrix  is  symmetric,  one 
generally  prefers  symmetric  scaling.  The  theory  for  symmetric  scaling  is  often  more  intricate  than 
for  nonsymmetric  scaling;  additionally,  algorithms  must  be  specialized  to  this  case. 

For  a  particular  scaling  method,  we  ask  three  questions:  1.  What  matrices  can  be  scaled  by 
the  method?  2.  Is  the  resulting  scaled  matrix  unique?  3.  Are  the  scaling  matrices  unique?  The 
first  question  is  self-evidently  important:  a  problem  solver  wants  to  know  in  advance  whether  a 
particular  matrix  of  interest  is  scalable  by  a  particular  method.  A  positive  answer  to  the  second 
question  assures  that,  relative  to  the  particular  scaling  method,  an  algorithm  has  done  as  well  as  it 
can:  there  is  not  another  scaled  matrix  that  is  better-  say,  has  a  lower  condition  number.  The  third 
question  may  matter  if  one  wants  the  scaling  matrices  to  have  particular  properties:  for  example, 
to  have  a  bounded  condition  number.  Nonuniqueness  provides  freedom  to  vary  the  scaling  matrices 
according  to  additional  criteria. 

It  is  unnecessary  to  scale  a  matrix  exactly;  approximate  scaling  may  achieve  a  sufficient  reduction 
in  condition  number.  The  class  of  approximately  scalable  matrices  may  be  larger  than  the  class  of 
exactly  scalable  matrices. 

Matrix-free  methods 

An  algorithm  is  said  to  be  a  matrix-free  method  if  it  obtains  information  about  a  matrix  only  through 
matrix-vector  products.  Matrix-free  methods  are  useful  when  a  matrix  is  represented  implicitly:  for 
example,  by  a  factorization  or  by  an  algorithm.  If  a  matrix  is  accessible  only  by  matrix-vector 
products,  we  think  of  the  matrix  as  an  operator. 

Quasi-Newton  methods  for  differential-equation-constrained  optimization  problems 

Simulating  a  physical  phenomenon  often  involves  solving  a  system  of  differential  equations.  The 
engineer  or  scientist  may  want  to  do  more  than  simply  simulate  a  system.  An  increasingly  common 
computational  problem  is  to  optimize  parameters  governing  the  physical  system.  The  resulting 
problem  is  a  differential-equation-constrained  optimization  problem  (DECP). 
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A  common  technique  for  solving  a  differential  equation  is  to  solve  the  discretized  problem  on  a 
sequence  of  increasingly  fine  meshes.  Often  the  fine  mesh  is  adapted  to  the  solution  on  the  coarse 
mesh.  The  same  technique  is  useful  when  solving  a  DECP.  If  a  quasi-Newton  method  is  used,  the 
data  by  which  the  QN  matrix  is  formed  on  the  coarse  mesh  may  be  useful  to  initialize  the  QN  matrix 
on  the  fine  mesh. 

Our  work 

In  Chapter  2,  we  review  the  theory  governing  the  existence  of  exactly  scalable  matrices  and  the 
uniqueness  of  the  scaling  matrices  and  the  scaled  matrix.  Then  we  introduce  e-scaling,  a  formaliza¬ 
tion  of  approximate  equilibration,  and  develop  the  analogous  existence  and  uniqueness  theory.  We 
discuss  both  the  nonsymmetric  and  symmetric  cases. 

Next  we  develop  matrix-free  methods  to  approximately  scale  symmetric,  square  nonsymmetric, 
and  rectangular  matrices.  We  test  our  algorithms  on  a  large,  well-known  test  set  of  matrices. 
Our  performance  criteria  are  change  in  condition  number  and  a  metric  quantifying  the  degree  of 
equilibration. 

In  Chapter  3,  we  use  our  ideas  on  matrix-free  approximate  equilibration  to  develop  a  limited- 
memory  quasi-Newton  method.  The  algorithm  updates  Bq  to  scale  the  problem.  It  uses  the  pairs 
{s,y}  just  as  any  QN  method  does,  and  so  no  additional  user  function  calls  are  necessary.  That  s 
and  y  are  related  by  an  approximate  matrix-vector  product  provides  the  crucial  connection  to  our 
matrix-free  equilibration  methods.  Since  our  algorithm  updates  B0,  it  complements  standard  QN 
methods  that  build  on  the  initial  matrix  Bq.  We  combine  our  update  with  a  well-known  limited- 
memory  method. 

After  designing  and  analyzing  the  method,  we  describe  its  implementation  in  SNOPT  [25],  a 
software  package  that  is  designed  to  solve  large,  generally  constrained  optimization  problems.  Then 
we  test  the  modified  SNOPT  on  six  test  sets.  While  computation  time  is  ultimately  the  only 
important  criterion,  it  can  be  quite  deceptive  when  testing  an  algorithm  on  test  sets  of  problems. 
The  test  set  may  contain  a  mix  of  problems  having  different  sizes  and  characteristics,  for  example. 
Thus,  while  we  report  time  for  test  sets  for  which  time  makes  sense,  our  primary  assessment  criterion 
is  the  number  of  user  function  calls. 

In  Chapter  4,  we  develop  a  method  to  interpolate  a  limited-memory  quasi-Newton  matrix  from 
a  coarse  mesh  to  a  fine  one.  We  report  results  for  two  classes  of  problems. 


Chapter  2 


Equilibration  of  matrices 


Preconditioning  a  matrix  is  an  important  first  step  for  solving  a  linear  equation;  iterative  methods 
particularly  benefit.  Broadly,  preconditioning  is  intended  to  cluster  eigenvalues  or  to  reduce  the 
condition  number  of  a  matrix.  Problem-dependent  preconditioners  often  achieve  both  goals.  In 
the  absence  of  any  other  preconditioner,  a  diagonal  scaling  matrix  is  useful  to  reduce  the  condition 
number  of  a  matrix.  The  problem  of  finding  a  diagonal  scaling  matrix  has  the  equivalent  names 
equilibration ,  balancing,  and  binormalization.  Equilibration  algorithms  exist  to  achieve  a  number  of 
different  objectives:  for  example,  unit  diagonal  elements,  equal  row  or  column  p-norms,  bounded 
elements.  The  objectives  may  be  achieved  either  exactly  or  approximately. 

For  a  particular  type  of  scaling,  several  questions  arise:  What  is  the  class  of  matrices  that  can 
be  scaled?  Is  the  scaling  unique?  How  much  is  the  condition  number  of  the  matrix  decreased?  Must 
the  scaling  be  exact,  or  can  it  be  approximate?  For  a  particular  problem,  additional  questions  arise. 
What  algorithms  can  implement  the  scaling?  In  particular,  does  one  have  access  to  the  elements  of 
the  matrix,  or  must  one  use  matrix-vector  products  only? 

In  this  chapter,  we  consider  the  problem  of  scaling  a  square  matrix  A  by  positive  diagonal  matrices 
so  that  the  p-norm  of  each  row  and  column  of  the  resulting  matrix  is  approximately  the  same.  Our 
focus  is  on  the  symmetric  problem,  in  which  a  symmetric  matrix  is  symmetrically  scaled.  First, 
we  review  existence  and  uniqueness  theorems  for  exact  scaling  of  A.  Second,  we  develop  and  prove 
a  necessary  and  sufficient  condition  for  approximately  scaling  positive  diagonal  matrices  to  exist. 
Third,  we  develop  two  new  approximate  scaling  algorithms  that  balance  the  row  and  column  2-norms 
and  that  access  A  only  through  matrix-vector  products.  Finally,  we  describe  the  performance  of  the 
algorithms  on  a  large  test  set. 
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2.1  Background 

2.1.1  Notation 

The  vectors  x  and  d  are  special:  Xi  =  d^1.  Let  V  =  diag(u),  where  v  is  a  vector.  For  a  matrix 
A,  let  B  be  such  that  Btj  =  |Ajjjp.  If  p  =  2,  B  =  A  o  A,  i.e.,  B  is  the  element-wise  (Hadamard) 
product  of  A.  If  A  is  complex,  the  o  operator  conjugates  the  first  argument,  e  is  the  vector  of  all 
ones.  The  square  of  a  vector  is  applied  by  element:  (v2)i  =  v2.  Similarly,  a  vector  product  written 
as  xy  is  performed  by  clement:  ( xy)i  =  Xiyi.  An  inequality  such  as  v  >  0,  where  v  is  a  matrix  or 
vector,  applies  element-wise.  Index  sets  are  denoted  by  calligraphic  letters,  and  A f  is  reserved  for 
the  set  {1,  2, . . .  ,n}.  The  cardinality  of  the  set  is  denoted,  for  example,  \AT\  =  n.  Often  a  sum  over 
an  index  set  is  abbreviated  when  the  index  set  is  clear;  for  example,  may  be  written  simply 

as  E  i- 

2.1.2  The  scaling  equation 

Suppose  we  want  the  p-norm  of  each  row  of  the  symmetric  matrix  A  to  be  c  >  0.  We  must  find  a 
vector  x  >  0  such  that 

xi  'y  '  \Aij\Pxj  =  c > 
j 

or,  more  compactly, 

XBx  =  ce.  (2.1) 

If  x*  satisfies  (2.1)  for  c  =  C\  >  0,  then  ^c2/ciX*  is  a  solution  for  c  =  c2  >  0.  Consequently,  we 
need  not  consider  the  particular  value  of  c  >  0  in  what  follows  and  so  we  set  c  =  1. 

We  refer  to  (2.1)  as  the  scaling  equation.  Although  algorithms  must  use  A,  existence  and  unique¬ 
ness  theory  need  consider  only  the  nonnegative  matrix  B.  If  p  =  1  and  A  is  nonnegative,  then 
A  =  B.  We  reserve  the  term  binormalization  for  the  case  p  =  2.  We  say  A  is  scalable  if  there  exists 
x  >  0  satisfying  (2.1).  We  call  such  a  vector  a  scaling  vector  or,  on  occasion  in  the  case  of  p  =  2, 
a  binormalization  vector.  If  two  matrices  B \  and  B2  are  related  to  each  other  by  B2  =  XB\X ,  we 
say  the  two  are  diagonally  equivalent. 

Observe  that  c~1XBX  is  doubly  stochastic. 

Not  all  matrices  are  scalable.  Furthermore,  the  benefits  of  scaling  may  be  obtained  by  approxi¬ 
mate  rather  than  exact  scaling.  Livne  and  Golub  [42]  introduced  a  measure  of  approximate  scaling: 
A  is  scaled  to  a  residual  whose  norm  is  e  if  || XBx  —  e||  =  e.  We  introduce  the  following  definition: 
A  is  e-scalable  if  for  any  e  >  0,  there  exists  x  >  0  satisfying  || XBx  —  e\\  <  e.  Of  course  a  scalable 
matrix  is  £-scalable. 
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Nonsymmetric  scaling  is  similar.  In  this  case,  there  exist  x,y  >  0  such  that 
XBy  =  e  and  YBTx  =  e. 

If  A  is  rectangular,  the  row  and  column  norms  may  differ  in  magnitude.  We  address  the  rectangular 
case  in  numerical  experiments  only;  our  theoretical  analysis  assumes  A  is  square.  A  is  nonsymmet- 
rically  e-scalable  if  for  every  e  >  0,  each  equation  can  be  satisfied  to  a  residual  having  norm  no 
greater  than  e. 

Numerical  experiments  (see,  e.g .,  [42]  and  Section  2.4)  for  the  case  p  =  2  show  that  the  condition 
number  of  the  scaled  matrix  A  is  often  considerably  less  than  that  of  A.  For  indefinite  A,  the  far 
simpler  procedure  of  scaling  to  unit  diagonal  elements  is  undefined  if  a  diagonal  element  is  zero 
and  problematic  if  a  diagonal  element  is  small.  Furthermore,  the  reduction  in  condition  number  is 
frequently  substantially  less  than  that  achieved  by  equilibration. 

2.1.3  Context  and  main  results 

Equilibration  of  matrices  has  a  long  history.  Research  on  this  problem  divides  into  three  parts: 
analyzing  the  improvement  diagonal  scaling  offers,  either  in  terms  of  reducing  the  condition  number 
or  a  related  quantity  of  the  matrix  or  in  terms  of  the  improvement  in  sparsity  or  quality  of  a 
factorization;  finding  necessary  or  sufficient  conditions  for  the  existence  or  uniqueness  of  diagonal 
scaling  matrices;  and  developing  and  analyzing  algorithms.  Several  equilibration  problems  exist, 
although  all  are  generally  formulated  in  terms  of  nonnegative  matrices;  most  notable  are  variations 
in  which  row  or  column  norms  are  not  equal  but  rather  are  specified  by  positive  vectors. 

Existence  and  uniqueness 

Sinkhorn  wrote  a  series  of  papers  in  the  1960s  that  detailed  some  of  the  existence  and  uniqueness 
theory  governing  matrix  equilibration.  Sinkhorn  and  Knopp’s  paper  [62]  is  quite  frequently  cited 
and  contains  the  well-known  iteration  for  matrix  balancing  that  bears  their  names.  Sinkhorn  and 
Knopp  proved  the  major  convergence  result  for  the  algorithm  in  [62]  in  the  course  of  proving  the 
condition  for  existence  of  scaling  matrices  for  the  nonsymmetric  problem;  independently,  Brualdi, 
Parter,  and  Schneider  proved  overlapping  results  in  [9].  Before  we  can  discuss  their  theorems,  we 
must  introduce  some  definitions  and  preliminary  results. 

A  square  matrix  has  total  support  if  every  nonzero  element  occurs  in  the  positive  main  diagonal 
of  a  matrix  that  is  a  column  permutation  of  the  original  matrix.  Formally,  let  A  be  a  square  matrix 
having  total  support,  and  suppose  is  nonzero.  Then  there  exists  a  permutation  a  such  that 
a(i)  =  j  and  each  Aka^  is  nonzero.  We  say  that  <r,  or  alternatively  the  positive  diagonal  associated 
with  <7,  supports  the  nonzero  element  Ajj . 

A  matrix  has  support  that  is  not  total ,  or  simply  has  support ,  if  a  positive  main  diagonal  exists 
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under  a  column  permutation.  By  this  definition  it  is  evident  that  a  square  matrix  has  support  if 
and  only  if  it  is  structurally  nonsingular  [18]. 

Mirsky  and  Perfect  [44]  showed  that  a  matrix  has  total  support  if  and  only  if  there  exists  a 
doubly  stochastic  matrix  having  the  same  zero  pattern.  Other  authors  use  only  the  “if”  part,  which 
follows  directly  from  Birkhoff’s  Theorem:  a  doubly  stochastic  matrix  is  a  convex  combination  of 
permutation  matrices  (see,  e.g.,  Theorem  8.7.1  of  [28]). 

A  matrix  A  is  partly  decomposable  if  there  exist  permutation  matrices  P  and  Q  such  that 


PAQ  = 


'  B  > 
Dj 


where  B  and  D  are  square  matrices.  A  square  matrix  is  fully  indecomposable  if  it  is  not  partly 
decomposable.  A  fully  indecomposable  matrix  has  total  support  [8]. 

Sinkhorn  and  Knopp  proved  the  following  in  [62]. 

Theorem  1  (Sinkhorn  and  Knopp).  Let  A  be  a  nonnegative  square  matrix. 

1.  A  is  scalable  if  and  only  if  A  has  total  support.  That  is,  there  exist  positive  diagonal  matrices 
D\  and  D2  such  that  B  =  D1AD2  is  doubly  stochastic. 

2.  If  A  is  scalable,  then  B  is  unique. 

3.  D\  and  D2  are  unique  up  to  a  scalar  multiple  if  and  only  if  A  is  fully  indecomposable. 

4.  The  Sinkhorn-Knopp  iteration  converges  to  a  doubly  stochastic  matrix  if  and  only  if  A  has 
support.  If  A  has  support  that  is  not  total,  then  D\  and  D2  have  elements  that  diverge. 

Hence  scalability  of  a  matrix  is  entirely  determined  by  its  zero  pattern.  Parts  1-3  were  independently 
discovered  in  [9]:  the  authors  of  each  paper  acknowledge  the  other  accordingly.  The  necessity  of 
total  support  follows  directly  from  Birkhoff’s  theorem. 

In  addition  to  working  on  the  nonsymmetric  problem,  Brualdi,  Parter,  and  Schneider  [9]  provided 
an  early  existence  result  for  scaling  symmetric  matrices.  Their  Corollary  7.7  is  equivalent  to  the 
existence  part  of  our  Theorem  4.  Marshall  and  Olkin  [43]  obtained  overlapping  results,  and  our 
proof  technique  for  Theorem  4  is  essentially  theirs.  Brualdi  and  his  coauthors  also  developed  several 
technical  results.  Corollary  7.8  concerns  the  structure  of  a  matrix:  Let  A  be  nonnegative  and 
symmetric.  Then  every  other  (not  necessarily  symmetric)  matrix  B  having  the  same  zero  pattern 
as  B  can  be  scaled  by  a  positive  diagonal  D  so  that  DBD  is  row  stochastic  if  and  only  if  A  has  a 
positive  diagonal.  Interestingly,  their  Lemma  7.3  reveals  the  same  block  structure  (2.9)  that  we  find 
in  our  Theorem  3,  although  the  mathematical  setting  is  different.  Theorem  8.2  of  [9]  states  a  rather 
technical  sufficient  condition  for  uniqueness  more  general  than  our  Theorem  4,  and  several  other 
more  general  sufficient  conditions  have  appeared,  including  in  [43]. 
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Brualdi,  Parter,  and  Schneider  proved  a  number  of  results  for  both  the  nonsymmetric  and  sym¬ 
metric  problems.  However,  according  to  Brualdi  [7],  Csima  and  Datta  [16]  obtained  the  definitive 
existence  theorem  for  symmetric  scaling  three  years  after  the  publication  of  [62]  and  [9]. 

Theorem  2  (Csima  and  Datta).  A  symmetric  matrix  is  scalable  if  and  only  if  it  has  total  support. 

Interestingly,  the  necessary  and  sufficient  condition  of  total  support  in  this  theorem  is  identical  to 
that  in  part  1  of  Theorem  1.  The  necessary  part  follows  directly  from  part  1  of  Theorem  1,  but  the 
sufficiency  part  requires  several  steps.  The  key  ideas  are  as  follows.  Total  support  of  a  symmetric 
matrix  A  implies  A  is  composed  (in  a  precisely  defined  sense)  of  fully  indecomposable  matrices 
having  total  support.  Each  such  matrix  can  be  scaled;  by  part  3  of  Theorem  1,  the  scaling  is  unique; 
and  so  the  scaling  is  symmetric  (because  A  =  AT  and  if  D1AD2  and  D2AtDi  are  doubly  stochastic 
and  A  is  fully  indecomposable,  Di  and  D2  are  unique  up  to  a  scalar  multiple).  The  scaling  matrices 
for  the  submatrices  of  A  can  be  assembled  to  form  a  scaling  matrix  for  A.  Total  support  is  essential 
to  this  proof,  and  so  the  matter  remains  of  support  that  is  not  total. 

It  appears  that  subsequent  work  has  not  resolved  the  question.  We  discovered  the  following 
condition  and  prove  it  in  Section  2.2.2. 

Theorem  3.  A  symmetric  matrix  is  e-scalable  if  and  only  if  it  has  support. 

An  analog  of  the  “if”  part  of  Theorem  3  for  the  nonsymmetric  problem  is  directly  implied  by  part  4 
of  Theorem  1:  one  can  terminate  the  Sinkhorn-Knopp  iteration  when  the  current  iterate  produces  a 
matrix  that  is  approximately  doubly  stochastic  to  a  given  tolerance.  Just  as  Theorem  2  reveals  the 
same  condition  as  part  1  of  Theorem  1,  Theorem  3  reveals  the  same  condition  as  part  4  of  Theorem 
1.  Yet  in  both  cases,  the  proofs  are  quite  different  for  the  nonsymmetric  and  symmetric  problems. 

The  question  of  uniqueness  remains.  Part  2  of  Theorem  1  states  that  if  A  has  total  support, 
although  Di  and  D2  may  not  be  unique,  the  doubly  stochastic  matrix  DiAD2  is.  Part  4  states  that 
if  A  has  support  that  is  not  total,  the  Sinkhorn-Knopp  iteration  converges  to  a  doubly  stochastic 
matrix  C.  Parlett  and  Landis  [57]  obtained  the  same  result  for  a  class  of  iterations  obeying  certain 
conditions,  of  which  the  Sinkhorn-Knopp  iteration  is  a  member.  Both  sets  of  authors  characterized 
the  matrix  C  as  follows.  To  each  nonzero  element  in  Aij  that  lacks  support  there  corresponds  a  zero 
element  C)j ;  to  each  nonzero  element  A,  j  that  has  support  there  corresponds  a  nonzero  element  C,l7 . 
This  is  not  surprising:  the  equivalence  of  having  total  support  and  having  the  same  zero  pattern  as 
a  doubly  stochastic  matrix  (due  to  Perfect  and  Mirsky),  combined  with  part  1  of  Theorem  1,  show 
that  a  matrix  lacking  total  support  does  not  have  the  zero  pattern  of  a  doubly  stochastic  matrix. 
However,  neither  Sinkhorn  and  Knopp  nor  Parlett  and  Landis  answered  the  following  question:  In 
what  sense,  if  any,  is  C  unique  if  A  has  support  that  is  not  total?  We  provide  an  answer  to  this 
question  in  terms  of  e-scalability  in  Theorem  6.  By  developing  a  uniqueness  theorem  in  terms  of 
e-scalability,  we  sidestep  the  issue  of  particular  algorithms:  any  algorithm  that  e-scales  a  matrix 
yields  a  matrix  C  that  is  unique  in  the  sense  of  Theorem  6. 
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Quite  a  number  of  papers  have  reported  new  methods  to  prove  classic  results.  For  example, 
Borobia  and  Canto  [5]  proved  Sinkhorn’s  early  result  on  scaling  positive  matrices  [61] — parts  1-3 
of  Theorem  1  applied  to  strictly  positive  matrices — using  geometric  methods.  O’Leary  [52]  proved 
Marshall  and  Olkin’s  theorem  from  [43] — an  spd  matrix  can  be  symmetrically  scaled  to  achieve  a 
given  positive  row  sum  vector  -using  a  new  constructive  method,  and  slightly  extended  the  result. 

Johnson  and  Reams  [30]  recently  described  a  number  of  new  conditions  characterizing  the  exis¬ 
tence  of  a  scaling  matrix  for  the  symmetric  problem  when  the  matrix  is  general  rather  than  nonneg¬ 
ative.  A  symmetric  matrix  is  copositive  if  xT Ax  >  0  for  all  x  >  0  and  strictly  copositive  if  xT Ax  >  0 
for  x  >  0  and  x  ^  0.  Marshall  and  Olkin  [43]  showed  that  A  is  scalable  if  it  is  strictly  copositive. 
The  rather  technical  Theorem  4  of  [30]  generalizes  this  condition  by  considering  a  certain  cone. 

Neumaier  and  Olschowka  [49,  54]  developed  a  scaling  result  related  to  structural  rank  for  a 
different  kind  of  scaling  than  we  consider.  They  showed  that  a  (symmetric)  matrix  that  has  support 
can  be  scaled  so  that  every  entry  of  the  resulting  (symmetric)  matrix  is  bounded  in  magnitude  by 
1  and  the  diagonal  elements  are  all  1. 

Algorithms 

Many  equilibration  algorithms  have  been  proposed.  The  classical  iteration  is  that  of  Sinkhorn  and 
Knopp,  who  first  analyzed  its  convergence  properties.  According  to  Parlett  and  Landis  [57],  the 
iteration  itself  was  used  as  early  as  1940,  and  according  to  Knight  [35],  as  early  as  the  1930s  in  an 
application  to  traffic  flow.  Parlett  and  Landis  [57]  developed  several  iterations  that  outperformed  the 
Sinkhorn-Knopp  iteration  on  a  test  set.  According  to  Knight  and  Ruiz  [36],  Khachiyan  and  Kalantari 
[33]  were  the  first  to  propose  using  Newton’s  method  to  solve  a  particular  system  of  equilibration 
equations.  Livne  and  Golub  [42]  developed  an  algorithm  based  on  the  Gauss-Seidel-Newton  method 
to  solve  the  nonlinear  binormalization  equations.  They  proved  that  the  algorithm  converges  locally. 
In  practice,  only  a  few  iterations  of  the  algorithm  are  performed  to  obtain  a  vector  that  suitably 
approximates  a  binormalization  vector.  Knight  and  Ruiz  [36]  compared  their  algorithm,  an  inexact 
Newton  method  using  the  conjugate  gradients  iteration,  favorably  with  the  classical  Sinkhorn-Knopp 
iteration  and  a  variation  of  Livne  and  Golub’s  method. 

Equilibration  in  the  infinity  norm  is  not  unique  and  so  motivates  multiple  algorithms  that  consider 
both  efficiency  and  quality  of  the  scaling  under  criteria  other  than  the  infinity  norms  of  the  rows  and 
columns.  A  matrix  can  be  scaled  in  the  infinity  norm  if  it  has  no  zero  rows  or  columns.  The  simplest 
algorithm  is  as  follows:  First  scale  the  rows  (or  columns),  then  scale  the  columns  (or  rows).  After  the 
first  scaling,  the  largest  number  in  the  matrix  is  1,  and  the  second  scaling  cannot  produce  numbers 
that  are  larger  than  1.  Therefore,  scaling  is  achieved  after  one  iteration.  Bunch  [11]  developed  an 
algorithm  that  equilibrates  any  symmetric  matrix  in  the  infinity  norm.  More  recently,  Ruiz  [60] 
developed  another  iteration  that  under  his  criteria  compares  favorably  with  Bunch’s  algorithm. 

Convergence  results  for  exact  scalability  in  which  total  support  is  an  assumption  immediately 
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yield  corresponding  convergence  results  for  e-scalability.  The  proof  of  Theorem  3  shows  that  if  A  is 
e-scalable,  an  e-scaling  vector  can  be  obtained  by  exactly  scaling  B  +  61  for  sufficiently  small  <5.  By 
Lemma  6,  a  symmetric  matrix  whose  main  diagonal  is  positive  has  total  support.  Therefore,  if  an 
algorithm  converges  for  matrices  having  total  support,  then  it  converges  for  B  +  51  and  so  produces 
an  e-scaling  matrix  for  B. 

Existence  and  uniqueness  theory  focuses  on  nonnegative  matrices  because  equilibration  in  a  p- 
norm  induces  an  equation  on  a  nonnegative  matrix  even  if  the  original  matrix  is  general.  But 
equilibration  algorithms  must  take  into  account  sign  information.  Not  all  authors  are  interested 
in  scaling  general  symmetric  matrices,  and  restricting  the  class  of  matrices  on  which  an  algorithm 
operates  can  be  advantageous.  Most  often  authors  restrict  their  attention  to  only  nonnegative  or 
positive  definite  matrices.  For  example,  Khachiyan  and  Kalantari  [33]  focus  on  spd  matrices. 

Matrix-free  algorithms 

To  date  it  appears  that  all  scaling  algorithms  for  general  symmetric  matrices  require  access  to 
the  elements  of  the  matrix.  If  A  is  nonnegative,  the  situation  is  much  different;  for  example, 
the  Sinkhorn-Knopp  algorithm  requires  only  the  matrix-vector  product  Ax.  For  general  matrices, 
algorithms  need  at  least  matrix-vector  products  of  the  form  \A\x  ( p  =  1),  (AoA)x  ( p  =  2),  or  similar 
expressions.  We  introduce  an  approximate  scaling  algorithm  for  the  case  p  =  2  that  requires  only 
the  matrix-vector  product  Ax. 

Matrix-free  and  stochastic  matrix-free  methods  have  a  vast  literature.  Of  course  the  most  com¬ 
mon  matrix-free  methods  are  Krylov-subspace  iterations.  Stochastic  matrix-free  methods  in  linear 
algebra  may  have  a  more  recent  origin,  but  stochastic  methods  for  general  problems  go  back  at  least 
as  far  as  the  first  work  of  Robbins  and  Monro  [59]  on  what  is  now  called  stochastic  approximation. 
Bekas,  Kokiopoulou,  and  Saad  [1]  developed  a  matrix-free  method  to  estimate  the  diagonal  elements 
of  a  matrix.  Chen  and  Demmel  [14]  developed  a  method  to  balance  a  matrix  prior  to  eigenvalue 
computations.  (Balancing  for  eigenvalue  problems  is  a  similarity  transform  and  so  is  not  the  same 
as  the  balancing  we  discuss.) 

2.1.4  Our  contributions 

Formalizing  approximate  scaling  by  e-scaling  is  useful.  Livne  and  Golub  [42]  seem  to  be  the  first 
to  have  recognized  the  value  of  defining  a  measure  of  approximate  scaling.  But,  first,  they  did  not 
make  concrete  their  definition;  and,  second,  they  left  open  existence  and  uniqueness  questions.  We 
realized  that  developing  a  definition  of  approximate  scaling  in  terms  of  a  limit  can  be  useful,  and 
we  answer  existence  and  uniqueness  questions  in  terms  of  this  definition. 

Without  the  framework  of  £-scaling,  it  is  hard  to  formulate  concrete  statements  about  scaling  a 
matrix  that  lacks  support.  For  example,  Parlett  and  Landis  [57]  wrote:  “Note  that  matrices  without 
support  are  not  covered  by  [Sinkhorn  and  Knopp’s  theorem] .  Such  matrices  are  always  singular,  and 
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V.  Kahan  (private  communication)  has  shown  that  the  sequence  of  iteration  matrices  . . .  produced  by 
[the  Sinkhorn-Knopp]  iteration  cycles  for  such  a  starting  matrix.”  This  observation  has  in  common 
with  others  that  statements  about  matrices  lacking  support  are  made  with  reference  to  the  Sinkhorn- 
Knopp  iteration  or  algorithms  that  share  certain  properties  with  it.  As  another  example,  part  4  of 
Theorem  1  concerns  explicitly  the  Sinkhorn-Knopp  iteration.  In  contrast,  because  Lemma  3  makes 
a  statement  about  e-scaling,  it  assures  that  no  algorithm — not  just  certain  ones — can  approximately 
scale  a  matrix  lacking  support. 

Some  statements  about  e-scaling  follow  directly  from  known  results.  Theorem  5  identifies  support 
as  a  sufficient  condition  for  nonsymmetric  e-scaling,  and  the  proof  follows  immediately  from  part 
4  of  Theorem  1.  The  proof  of  the  necessary  condition  Lemma  3  is  fairly  straightforward  and  so 
probably  has  been  used  before  in  a  different  context.  Still,  it  may  be  new  within  the  context  of 
scaling.  For  although  the  proof  of  Lemma  3  requires  only  a  few  minor  changes  to  show  that  support 
is  a  necessary  condition  for  exact  scaling,  previous  authors  have  used  Birkhoff’s  theorem  to  prove 
this  necessary  condition  in  the  exact  case.  Birkhoff’s  theorem  seems  not  to  be  applicable  in  the  case 
of  e-  rather  than  exact  scaling. 

We  believe  Theorems  3  and  6  are  new  and  are  not  immediate  extensions  of  known  results  to  the 
framework  of  e-scaling.  Theorem  3  establishes  a  necessary  and  sufficient  condition  for  a  symmetric 
matrix  to  be  e-scalable.  Just  as  part  1  of  Theorem  1  and  Theorem  2  have  the  same  condition  but 
require  different  proofs,  so  do  Theorems  5  and  3.  Theorem  6  establishes  uniqueness  in  a  certain 
sense  for  both  nonsymmetric  and  symmetric  e-scaling.  This  uniqueness  result  is  analogous  to  part 
2  of  Theorem  1  for  the  case  of  exact  scaling  of  a  matrix  having  total  support. 

Finally,  the  stochastic  iterations  (2.25)  and  (2.27)  are  to  our  knowledge  new  and  appear  to  be 
the  first  matrix-free  algorithms  to  scale  a  matrix  at  least  approximately. 


2.2  Existence  and  uniqueness 

One  reason  for  investigating  the  conditions  under  which  a  matrix  is  scalable  is  to  determine  when 
scaling  can  be  used  as  a  preconditioner.  From  this  practical  viewpoint,  approximate  rather  than 
exact  scaling  is  all  that  is  required.  Uniqueness  of  the  scaled  matrix — in  the  case  of  approximate 
scaling,  uniqueness  in  a  certain  meaningful  sense — tells  us  that  a  particular  scaling  algorithm  achieves 
all  that  scaling  permits  by  finding  one  set  of  scaling  vectors:  although  there  may  be  other  sets  of 
scaling  vectors,  there  is  not  some  other,  possibly  better  conditioned,  scaled  matrix. 

In  this  section,  we  prove  Theorems  3  and  6.  The  first  gives  the  necessary  and  sufficient  condition 
for  e-scalability.  The  second  establishes  uniqueness  of  the  approximately  scaled  matrix.  So  that 
the  theory  supporting  Theorem  3  is  fully  contained  in  this  chapter,  we  prove  a  result  about  exact 
scaling-  -Theorem  4,  a  condition  for  existence  and  uniqueness  for  exact  scaling — that  we  shall  use 
in  proving  Theorem  3;  furthermore,  some  of  the  steps  of  the  proof  of  Theorem  4  are  used  in  the 


12 


CHAPTER  2.  EQUILIBRATION  OF  MATRICES 


proof  of  Theorem  10.  The  existence  part  of  Theorem  4  is  weaker  than  Theorem  2 — a  necessary  and 
sufficient  condition  for  existence  of  an  exact  scaling  matrix — and  the  uniqueness  part  is  weaker  than 
other  theorems  in  the  literature,  but  it  is  all  we  need. 

What  follows  is  based  on  B1  not  A,  and  so  is  true  for  all  p-norms;  however,  we  often  phrase  the 
statements  of  the  theorems  in  terms  of  A. 

2.2.1  Exactly  scalable  matrices 

In  this  section,  we  prove  the  following. 

Theorem  4.  If  every  diagonal  element  of  the  symmetric  matrix  A  is  nonzero,  then  A  is  scalable. 
Moreover,  the  scaling  vector  is  unique. 

We  say  x  >  0  is  the  unique  scaling  vector  for  the  matrix  A  if  any  other  scaling  vector  y  >  0  is  related 
to  a;  by  a  scalar  multiple. 

An  immediate  corollary  is  that  every  positive  definite  matrix  A  has  a  unique  scaling  vector.  The 
existence  part  of  the  theorem  is  equivalent  to  Corollary  7.7  of  [9].  The  primary  tool  of  finding  a 
function  for  which  the  scaling  equation  is  a  gradient  is  similar  to  that  used  in  [43].  The  uniqueness 
part  is  weaker  than  Theorem  8.2  of  [9]  and  the  theorem  of  Marshall  and  Olkin  [43]. 

Since  x  >  0,  we  can  multiply  both  sides  of  (2.1)  by  X~1\  Bx  =  X~1e.  Rearranging,  we  recognize 
that 


g(x)  =  Bx  —  X  1e 


is  the  gradient  of  the  function 


(2.2) 


The  Hessian  of  /  is 

H(x)  =  B  +  X~2. 

Lemma  1.  If  An  ^  0  for  each  i,  then  f(x)  has  a  minimizer  x*  >  0. 

Proof.  Let  2/3  =  min.  Bn ,  which  is  positive  by  assumption.  The  first  term  of  /  is  bounded  below  by 
/3||x|||  >  0.  The  second  term  is  bounded  as 


n 


Therefore,  as  ||x||  — >  oo,  /  — >  oo. 
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Let  x  —  mini  Because  the  second  term  of  /  goes  to  oo  as  x  —■ >  0  and  the  first  term  is  bounded 


below  by  0,  /  — >  oo  as  x  0. 


Choose  x  >  0,  e  >  0  and  let  /  =  f(x).  Since  limx^o  /(x)  =  lim||a.|i_>00  f(x)  =  oo,  there  exists  a 
compact  set  f l  containing  x  on  whose  boundary  dfl  f(y)\yedfi  =  />/  +  £•  Since  x  €  0,  /  =  /  —  e, 
and  /  £  C°°,  fi  contains  a  minimizer  x*  of  /  in  its  interior.  Therefore,  f(x)  has  a  minimizer 


x*  >  0. 


n 


Lemma  2.  If  An  ^  0  for  each  i  and  g(x)  =  0,  then  H{x)  is  positive  definite. 

Proof.  If  XHX  is  positive  definite,  then  so  is  H .  Now, 

XHX  =  XBX  +  I. 

If  g(x)  =  0,  then  C  =  XBX  is  doubly  stochastic.  An  ^  0  by  assumption  and  so  Cu  >  0.  By  the 
Gershgorin  theorem  [28] ,  the  eigenvalues  of  C  lie  in  the  union  of  discs  centered  on  each  of  the  Cu .  As 
Cu  >  0  and  the  sum  of  the  elements  in  a  row  is  one,  each  disc  has  radius  less  than  one.  Therefore, 
the  minimum  eigenvalue  Amin(C)  >  —1,  Amin(C  +  I)  >  0,  and  so  Am; n{H)  >0.  □ 

Proof  of  Theorem  f.  Since  g{x)  is  the  gradient  of  f(x),  and  g(x)  =  0  if  and  only  if  x  is  a  scaling 
vector  for  A,  a  minimizer  of  /  is  a  scaling  vector  for  A.  By  Lemma  1,  a  scaling  vector  exists. 

In  fact,  any  stationary  point  of  /  is  a  scaling  vector  for  A.  But  Lemma  2  shows  that  every 
stationary  point  of  /  is  a  strong  minimizer.  /  has  only  one  stationary  point  because  /  €  C°°  exists 
on  the  open  set  x  >  0,  limx^o  fix)  =  lim||x|i_>00  f{x)  =  oo,  and  every  stationary  point  of  /  is  a 
strong  minimizer.  Therefore,  A  has  a  unique  scaling  vector.  □ 

A  matrix  with  at  least  one  zero  diagonal  element  may  be  scalable,  but  the  scaling  vector  may 
not  be  unique.  For  example,  consider 


and  XBx  =  e.  Both  rows  of  A  imply  X\X2  =  1,  and  so  there  are  infinitely  many  scaling  vectors. 

2.2.2  ^-scalable  matrices 

In  this  section,  we  find  a  necessary  and  sufficient  condition  for  both  nonsymmetric  and  symmetric 
e-scaling,  and  we  show  that  the  approximately  scaled  matrix  is  unique  in  a  certain  sense. 
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Examples 


Let  us  look  at  a  few  examples  of  the  symmetric  problem.  Consider  the  structurally  rank  deficient 
matrix 


(l 


A=  1 

V1 


(2.3) 


and  the  scaling  equation  XBx  =  e.  The  second  and  third  rows  imply  X1X2  =  1,  #i#3  =  1; 
substituting  these  into  the  first  row  yields 


x\  +  X\X2  +  #l#3  =  1 

#1  +  1  +  1  —  1 

x\  =  — 1 


Therefore,  XBx  =  e  has  no  real  solution,  but  it  has  two  bounded  imaginary  solutions. 

In  contrast,  consider  the  structurally  (and  numerically)  nonsingular  matrix  A  =  ^  ^  .  The 

second  row  implies  *1X2  =  1;  substituting  this  into  the  first  row  yields 


x\  +  X\X2  =  1 
#1  +  1  =  1 
x\  =  0. 


Therefore,  XBx  =  e  has  no  solution.  However,  suppose  we  set  x\  =  p  and  #2  =  p  1  ■  Then 


XBx  —  e  = 


the  residual  goes  to  zero  as  p  — >  0,  and  so  A  is  e-scalable. 

As  a  more  complicated  example  (and  one  that  is  not  diluted — a  technical  condition  in  [42]  under 
which  a  matrix  is  not  scalable — as  the  previous  example  matrix  is),  consider  the  structurally  and 
numerically  nonsingular  matrix 


A  = 


l1  1 

1  1 

V1 


The  first  and  third  rows  imply  xf  +  #i#2  =  0,  and  so  the  equation  does  not  have  a  positive  solution. 
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However,  if  we  set  x\  =  p,  X2  =  1,  and  X3  =  p  ,  then  the  residual  goes  to  zero  as  p  —>  0. 

Finally,  it  is  clear  that  the  matrix  A  in  (2.3)  is  not  e-scalable.  For  suppose  the  equations  implied 
by  the  second  and  third  rows  are  satisfied  to  a  residual  e.  Then  the  first  equation  is 

1=  x\  +  X1X2  +  X1X3  =  x\  +  (1  +  O(e))  +  (1  +  O(s))  =  x\  +  2  +  0(e)  =>  x\  =  -1  +  O(e). 

For  e  sufficiently  small,  x\  is  negative. 

Results 

Lemma  3.  If  B  is  e-scalable,  then  it  has  support. 

Proof.  Suppose  B  lacks  support  but  is  e-scalable. 

The  structural  rank  of  B  is  the  size  of  the  maximum  matching  of  the  bipartite  graph  induced  by 
the  rows  and  columns.  Hence  if  B  is  structurally  singular,  then  by  Hall’s  Theorem  [34]  it  has  a  set 
of  column  indices  C  such  that  the  matrix  B(:,C)  (using  Matlab  notation)  contains  r  <  \C\  nonzero 
rows.  Let  the  row  indices  of  the  nonzero  rows  be  7 Z;  |72|  =  r. 

If  B  is  e-scalable,  then  there  exist  x,  y  >  0  such  that 

XBy  =  e  +  O(e)  (row  equations) 

Y BT x  =  e  +  O(e)  (column  equations). 

Consider  the  sum  over  the  column  equations  j  €  C: 


BVXiVj  =  \C\  +  °(£ )' 


jec  i 

Each  term  in  the  lhs  of  this  equation  also  appears  in  the  sum  over  the  row  equations  i  £  1Z,  and  so 


(2.4) 


j 


But  1 72. |  <  |C|,  and  so  for  e  sufficiently  small,  we  have  a  contradiction.  Hence  B  is  not  e-scalable.  □ 
Lemma  4.  If  B  has  support,  then  it  is  nonsymmetrically  e-scalable. 

Proof.  By  part  4  of  Theorem  1,  the  Sinkhorn-Knopp  iteration  converges  to  a  doubly  stochastic 


matrix  if  B  has  support.  Let  Ck  be  the  scaled  matrix  at  iteration  k;  C  =  lim^oo  Ck  is  doubly 
stochastic.  Hence  for  every  e  >  0,  there  is  an  iteration  K  such  that  for  all  k  >  K , 


|| XkByk  -  e||  <  e  and  \\YkBxk  -  e||  <  e. 


□ 


Combining  Lemmas  3  and  4  yields  the  following  theorem. 
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Theorem  5.  B  is  nonsymmetrically  e-scalable  if  and  only  if  it  has  support. 

Lemma  5.  If  B  is  e-scalable,  then  so  is  PBQT  for  permutation  matrices  P  and  Q. 

Proof.  XBy  =  e  +  O(e)  =  Pe  +  0[e)  =  ( PXPT)(PBQT)(Qy ),  and  similarly  for  YBTx.  □ 

Proof  of  Theorem  3.  The  necessary  part  of  the  theorem  follows  from  Lemma  3,  for  if  B  is  not 
nonsymmetrically  £-scalable,  then  it  is  not  symmetrically  £-scalable. 

Now  we  prove  the  sufficiency  part.  Suppose  <5  >  0  in  the  equation 

Y{B  +  SI)y  =  e.  (2.5) 

By  Theorem  4,  a  unique  x  >  0  satisfies  this  equation.  By  the  proof  of  Theorem  4,  Xi(S)  is  a 
continuous  function  of  6  because  it  is  the  unique  minimizer  of 

f(x,S)  =  \xT (B  +  SI) x  -  y^lnzj 

i 

for  6  fixed,  and  /  is  a  continuous  function  of  (x,5). 

Suppose  we  use  x(6)  as  an  approximate  scaling  vector  for  B.  Then  the  residual  to  the  scaling 
equation  is 

XBx  -  e  =  -SXx.  (2.6) 

We  shall  show  that  if  B  has  support,  then  lim^o  5||^x|[  =  0;  and  so  for  every  e  >  0,  there  exists  a 
5  >  0  such  that  5||Xx||  <  e. 

Let  T  =  {i  :  lim^o  Xi  =  oo}.  Observe  that 

i,j  el  (possibly  i  =  j)  only  if  B^  =  0,  (2.7) 

for  otherwise  the  term  BijXiXj  would  grow  without  bound. 

Let  Z  =  {i  :  Bij  ^  0  and  j  £  T}.  If  i  £  Z,  then  lim^o^i  =  0,  for  again  otherwise  the  term 
BijXiXj  would  grow  without  bound. 

Let  B  =  {i  :  i  Z}.  If  i  £  B,  then  because  i  ^  X, 

lim  Xi  =  x*  <  oo.  (2.8) 

By  Lemma  5,  we  can  assume  B  is  ordered  such  that  the  first  \T\  rows  are  the  equations  i  £  T 
and  the  next  \Z\  rows  are  the  equations  j  £  Z.  The  blocks  Bxx  and  Bxb  of  B  are  zero.  The  first 
follows  from  (2.7).  The  second  is  true  because  if  Bij  ^  0  for  i  £  X,  then  j  £  Z  by  the  definition  of 
Z.  Therefore,  the  only  nonzero  block  in  the  first  \X\  rows  is  the  middle  one,  B%z-  In  summary,  the 
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block  structure  of  B  is 


B 


Bzi 


Bjz 

Bzz 

Bbz 


\ 

Bzb 

Bbb  J 


Corresponding  to  the  ith  row  of  (2.5)  is  equation  i: 


(Bn  T  8^X^  T  Xi  ^  (  BijXj  1. 

Consider  an  equation  k  £  T.  We  arrange  the  terms  so  that 


xk  E  BkjXj  =  1  -  Sx2k. 
jez 

Summing  the  equations  i  £  Z, 


(2.9) 


(2.10) 


E 

i£Z 


( Bu  +  S)x2  +  x , 


Y,b 


IjUsj 


1^1, 


(2.11) 


where  \Z\  is  the  number  of  elements  in  the  set  Z.  Similarly,  summing  the  equations  k  £  T  in  the 
form  (2.10), 


E Xk  E  B^i  =  ixi  - 6  E  xl-  (2-12) 

fcEX  /e£Z 

Because  B  is  symmetric  and  has  the  block  structure  (2.9),  every  term  in  the  lhs  of  (2.12)  appears 
in  the  lhs  of  (2.11).  Subtracting  (2.12)  from  (2.11), 

E  &(*)*<  =  -  ixi) + 5  E  xl  (2-13) 

i(zZ  fc£Z 


where 


/^i(^)  —  (Bn  “I-  tyXi  “b  Xi  ^  (  BijXj . 

By  (2.8)  and  the  definition  of  Z,  liriig _ >o  >%($)  =  0;  and  by  the  definition  of  Z,  lim^o  Xi  =  0. 

Therefore,  the  lhs  of  (2.13)  converges  to  0. 

Consider  the  rhs  of  (2.13).  There  are  three  cases  to  consider: 

1.  If  \Z\  >  |I|,  then  the  rhs  is  bounded  away  from  0,  which  contradicts  that  the  lhs  converges  to 

0. 
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2.  If  \Z\  <  \T\,  then  B ,  and  so  A  G  Rnxn,  is  structurally  singular,  which  contradicts  our  assump¬ 
tion.  For  if  \Z\  <  \1\,  then  Bxz  is  taller  than  it  is  wide,  and  so  A  is  structurally  singular. 

3.  Consequently,  \Z\  =  \T\1  and  so 

fcez 

as  is  true  of  each  term  separately. 

The  limit  (2.14)  and  the  fact  that  every  Xi  for  i  £  I  is  bounded  above  by  definition  implies  that 
the  norm  of  the  residual  (2.6)  converges  to  0  as  5  — >  0.  □ 

Lemma  6.  If  B  is  structurally  symmetric  and  has  a  positive  main  diagonal,  then  it  has  total  support. 

Proof.  If  Bij  ^  0,  then  a  supporting  permutation  is  er  such  that  <j{i)  =  i  if  i  j,  cr (i)  =  j,  and 
cr(j)  =i.  □ 

Let  ts(-B)  be  such  that  ts (B)ij  =  B^  if  B^  =  0  or  B^  has  a  supporting  diagonal;  hence  ts(-B) 
has  total  support  if  B  has  support.  Let  pat(-B)  be  such  that  pat  {Bij)  =  1  if  and  only  if  Bl:]  ^  0. 

Let  B  be  e-scalable.  An  e-scaling  algorithm  computes  vectors  x(e)  and  y{e)  (possibly  x(e)  =  y{e)) 
for  e  >  0  such  that  the  vectors  satisfy  the  e-scaling  equation. 

Lemma  7.  If  B  is  e-scalable,  then  C  =  lime_,o  X{e)BY{e)  is  doubly  stochastic. 

Proof.  By  the  definition  of  e-scaling,  X(e)By{e)  =  e  +  0(e),  and  so  lim£^o  X{e)By{e) 

O(e)  =  e;  and  similarly  for  Y(e)BTx(£). 

Let  u  =  lime_,o  u(e)  for  various  vectors  u.  In  some  cases,  elements  of  u  are  0  or  oo. 

Theorem  6.  Let  B  be  e-scalable  and  C  =  lim£^o  X(e)BY(e),  where  x{e)  and  y(e) 
by  any  e-scaling  algorithm.  C  is  the  unique  doubly  stochastic  matrix  to  which  ts(13) 
equivalent. 

Proof.  1.  First,  C  has  total  support  by  Lemma  7  and  the  fact  that  a  doubly  stochastic  matrix  has 
total  support. 

2.  Second,  scaling  cannot  introduce  a  nonzero  Cjj  when  B^  =  0;  hence  if  B^  lacks  support,  then 
Cij  =  0.  We  shall  show  that  if  B^  ^  0  and  has  support,  then  Cij  ^  0.  These  two  statements  imply 

pat(C)  =  pat(ts(B)).  (2-15) 

3.  By  Lemma  5,  we  can  assume  that  B  and  C  are  permuted  such  that 


=  lim£^0  e  + 
□ 

are  produced 
is  diagonally 


x{e)j  G  0(x(e)i)  and  y(e)j  G  Ll{y(e)i)  for  j  >  i; 


(2.16) 
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that  is,  asymptotically  x{e)j  increases  no  faster  than  x(e)i  and  y(e)j  decreases  no  faster  than  y{e)i- 

Consider  a  particular  product  Xij  =  lime_,0  xi£)iy{£)j-  If  Xij  =  oo,  then  B^  =  0,  for  otherwise 
Cij  =  oo.  There  must  be  at  least  one  pair  (i,  j)  in  every  row  and  every  column  such  that  0  <  Xij  <  oo , 
for  otherwise  C  would  lack  support. 

Suppose  there  is  a  pair  (i,j)  such  that  Xij  =  0.  By  (2.16),  Xkm  =  0  for  all  pairs  (fc,m)  such  that 
k  >  i  and  m  <  j.  Similarly,  suppose  there  is  a  pair  (i.  j)  such  that  Xij  =  oo;  then  Xkm  =  oo  for  all 
pairs  ( k ,  m)  such  that  k  <  i  and  m  >  j. 

Given  a  pair  (k,  m)  such  that  Xkm  =  0,  we  can  find  a  pair  ( i,j )  such  that  Xij  =  0,  X(i-i)j  7^  0, 
and  Xi(j+ 1)  7^  0)  f°r  otherwise  C  would  have  a  zero  column  or  row  and  so  lack  support.  As 
Xij  =  0  but  X(i-i)j  7^  0,  x(e)i  ^  fi(a;(e)j_i);  and  x(e)i  £  0(x(e)j_  1)  by  (2.16).  Furthermore, 
x(e)i- 1  £  ©(^(e)^1)  and  so  0  <  X(i-i)j  <  00,  for  otherwise  C  would  have  a  zero  column  and  so 
lack  support.  Similarly,  y(e)j+ 1  ^  0(y(e)j),  y{e)j+ 1  £  0.{y{e)j),  and  y{e)~h  £  0(x(e)i).  Hence 
X(i-i)(j+i)  =  00,  and  so  x  has  tl16  block  structure 


and  so  C  has  the  block  structure 


As  C  has  total  support,  each  nonzero  block  has  total  support  and  so  is  square.  Hence  A  =  C  and 


C  =  V. 


Now  suppose  Cij  =  0  but  ^  0  and  has  support.  The  pair  (i.  j)  must  be  in  the  (2, 1)  block 

of  x ■  Consider  a  permutation  a  that  supports  Bij.  As  <r(i)  =  j,  there  is  a  row  a  £  A  such  that 
cr(a)  £  B,  and  so  Baa^  ^  0.  But  Xaa(a)  =  00.  That  Baa{a)  ^  0  and  Xaa(a)  =  00  contradicts  that 
Coa(a)  =  0.  Equation  (2.15)  follows. 

4.  Next,  we  show  that  ts (B)  is  diagonally  equivalent  to  C.  Consider  a  pair  (i,j)  such  that  C\j  ^  0. 
As  0  <  C^  <  00,  0  <  lim£^0  xi{£)Vj{£)  <  Hence  there  exists  a  function  /(e)  such  that 
Xi(e),  yj{e)~l  £  0(/(e)).  If  Cik  ^  0  and  k  ±  j,  yfc(e)_1  £  0(/(e)),  for  otherwise  0  <  Cik  <  00 
does  not  hold;  and  similarly  for  Ckj  and  k  ^  i.  Hence  there  are  index  sets  A4  and  J\f  such  that 
Xi{e)  £  0(/(e))  for  i  £  M  and  yj{e)~1  £  0(/(e))  for  j  £  A f.  Let  Xi{e)  =  /(e)_1Xj(e)  for  i  £  M 
and  yj(e)  =  f{e)yj(e)  for  j  £  M.  Then  0  <  Xi  <  00  and  0  <  ijj  <  00.  Furthermore,  Xi(e)yj(e)  = 
f(E)~1Xi(E)f(E)yj(£)  =  Xi(e)yj{e),  and  so  BijXiyj  =  Cij. 

We  can  repeat  this  process  for  all  pairs  (i,j)  such  that  C.l?  ^  0  and  Xi,  ijj  have  yet  to  be  defined. 
Once  Xi,  ijj  are  dehned  for  every  i  and  j,  we  have  scaling  vectors  x,  y  such  that,  recalling  (2.15), 
Xts  {B)Y  =  C. 
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5.  Finally,  by  part  2  of  Theorem  1,  C  is  the  unique  doubly  stochastic  matrix  to  which  ts (B)  is 
diagonally  equivalent.  jjj 


2.2.3  Symmetric  positive  definite  matrices 

Although  we  largely  omit  a  discussion  of  conditioning  results,  the  case  of  spd  matrices  will  be 
important  to  us  when  we  consider  quasi-Newton  methods. 

We  call  symmetric  scaling  of  A  such  that  the  result  has  all  unit  diagonal  elements  Jacobi  scaling. 
If  A  is  spd,  then  Jacobi  scaling  is  always  possible,  as  every  diagonal  element  is  positive. 

Van  der  Sluis  [63]  showed  the  following.  Let  A  be  an  n  x  n  Hermitian  positive  definite  matrix. 
Suppose  all  the  main  diagonal  elements  are  1.  Let  k(-)  denote  the  condition  number  of  a  matrix. 
Then  k(A)  <  riming  «(diag(d) A diag(d))  (Theorem  4.1  of  [63]).  Additionally,  if  A  has  at  most  q 
nonzero  elements  in  any  row,  then  k(A)  <  gmin^  K(diag(d)Adiag(d))  (Theorem  4.3  of  [63]). 

Quite  a  bit  earlier,  Forsthye  and  Straus  [21]  proved  the  following  tighter  result  for  a  class  of 
Hermitian  p.d.  matrices.  A  matrix  B  has  Young’s  property  A  if  there  exists  a  permutation  matrix 
P  such  that 


PBpT=(Dl  Bl 
\  Bo  Do 


where  D\  and  Z)2  are  square  diagonal  matrices.  If  a  Hermitian  p.d.  matrix  A  has  Young’s  property 
A,  then  k(A)  <  min^  Ac(diag(d)Adiag(d))  (Theorem  4  of  [21]). 

In  summary,  in  these  three  theorems  Jacobi  scaling  is  within  a  factor  of  n,  q ,  or  1  of  optimal 
among  all  diagonal  scaling  matrices.  Since  Jacobi  scaling  is  evidently  quite  effective  on  spd  matrices, 
we  should  investigate  by  how  much  equilibration  differs  from  Jacobi  scaling. 

If  A  is  spd,  then  so  is  B  =  A  o  A  by  the  Schur  Product  Theorem  (see,  for  example,  Theorem 
7.5.3  of  [28]).  Suppose  A  has  unit  diagonal  elements.  Then  so  does  B.  Moreover,  if  i  A  j,  then 
Bij  <  1.  For  suppose  >  1.  Let  v  be  the  vector  such  that  Uj  =  1,  Vj  =  —1,  and  Vk  =  0  for  all 
other  elements.  Then  vTBv  =  2  —  2 Bij  <  0,  which  contradicts  that  B  is  spd. 

Suppose  Jacobi  scaling  has  been  applied  to  an  n  x  n  symmetric  matrix  A  to  yield  a  matrix  A, 
and  again  let  B  =  Ao  A.  If  A  is  indefinite  and  has  a  zero  diagonal  element,  we  set  the  corresponding 
element  in  the  scaling  vector  to  1.  Consider  the  vector  of  row  sums  b  =  Be.  If  A  is  indefinite, 
0  <  bi  <  oo.  If  A  is  p.d.,  as  every  diagonal  element  of  B  is  1,  bi  >  1;  and  as  every  off-diagonal 
element  B^  <  1,  bi  <  n. 

Let  var(u)  be  the  variance  of  an  n-vector  v :  var(u)  =  n^1  ybfa,  —  p(v))2,  where  p  is  the  mean 
of  the  elements.  If  a  matrix  is  binormalized,  then  the  variance  of  the  vector  of  its  row  norms  is  0.  If 
A  is  indefinite,  var(JJe)  can  be  arbitrarily  large.  But  if  A  is  p.d.,  we  have  the  following  result. 
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Theorem  7.  If  A  is  spd  and  has  all  unit  diagonal  elements ,  then  var(Be)  <  (n  —  l)2. 

Proof.  As  each  1  <  bi  =  ( Be)i  <  n,  ( bi  —  /r(5))2  <  {n  —  l)2.  Therefore,  n_1  —  p(b))2  < 

n~l  J2i(n  ~  l)2  =  (n  -  !)2-  □ 

From  the  other  direction,  an  immediate  corollary  of  some  results  in  [41]  is  that  if  an  spd  matrix  A 
is  equilibrated  in  the  1-norm  to  form  A ,  then  n-1  <  An  <  1  (the  upper  bound  follows  immediately 
from  equilibration  to  unit  row  and  column  1-norms);  if  A  is  indefinite,  of  course,  then  —  1  <  An  <  1. 

Theorem  7  shows  that  when  A  is  spd,  Jacobi  scaling  produces  a  matrix  that  is  not  arbitrarily  far 
from  being  binormalized.  This  observation  suggests  that  the  two  methods  of  scaling  condition  spd 
matrices  comparably.  Numerical  experiments  in  Section  2.4  compare  the  conditioning  each  method 
yields  on  a  set  of  spd  matrices;  the  results  are  indeed  quite  similar. 

2.3  Matrix-free  approximate  symmetric  equilibration 

Exact  scaling  algorithms  for  general  matrices  require  access  to  the  matrix  elements.  We  develop 
approximate  algorithms  that  access  the  matrix  only  through  matrix-vector  products. 

Our  analysis  in  Section  2.2  shows  that  every  structurally  nonsingular  matrix  can  be  £-scaled.  The 
degree  of  approximation  is  measured  by  e,  which  quantifies  how  much  the  scaled  matrix  deviates 
from  being  doubly  stochastic.  In  a  number  of  algorithms  [42,  57,  62,  36],  one  can  measure  this  error 
at  each  iteration  and  terminate  when  it  falls  below  e. 

As  we  focus  on  the  problem  in  which  one  does  not  have  direct  access  to  the  elements  of  the  matrix, 
our  algorithms  cannot  monitor  the  error.  In  the  absence  of  any  error  information,  our  algorithms 
run  for  the  number  of  iterations  specified  by  the  user  and  then  terminate. 

2.3.1  The  symmetric  Sinkhorn-Knopp  iteration 

Consider  a  nonnegative  n  x  n  matrix  B.  Let  r  and  c  be  positive  n- vectors.  Recall  our  convention 
that  r~x  is  the  element-wise  reciprocal  of  r.  The  Sinkhorn-Knopp  iteration  is  as  follows: 

rk+1  =  (Bck)~l 
ck+1  =  (BTrk+1)~1. 

By  Theorem  1,  if  B  has  total  support,  then  the  limits  r  =  lim^—too  rk  and  c  =  lim^oo  cr  exist 
and  RBC  is  doubly  stochastic;  and  if  B  is  fully  indecomposable,  r  and  c  are  unique  up  to  a  scalar 
multiple. 

The  symmetric  Sinkhorn-Knopp  iteration  is 


xk+1  =  (Bx1*)-1. 
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In  this  iteration,  x  alternately  takes  the  role  of  r  and  c  in  the  nonsymmetric  iteration:  in  particular, 
x2k  =  rk  and  x2k+1  =  ck.  Knight  [35]  discusses  similar  ideas.  We  shall  see  that  because  of  the 
simple  relationship  between  x ,  r,  and  c,  in  general  xk  oscillates  between  two  sequences. 

A  matrix  A  is  reducible  if  there  exists  a  permutation  matrix  P  such  that 


PAP1  = 


where  B  and  D  are  square.  Otherwise  A  is  irreducible.  Observe  that  if  A  is  symmetric,  then  PAPT 
is  block  diagonal  and  symmetric. 


Theorem  8.  If  the  symmetric  nonnegative  matrix  B  is  irreducible  and  fidly  indecomposable,  then 
the  symmetric  Sinkhorn-Knopp  iteration  converges  to  x  such  that  aXBX  is  doubly  stochastic  for 
all  positive  starting  vectors,  where  a  >  0  is  a  scalar. 


Proof.  If  B  is  fully  indecomposable,  then  B  has  total  support  [8]. 

By  part  4  of  Theorem  1,  as  I?  has  support,  the  nonsymmetric  Sinkhorn-Knopp  iteration  converges 
for  all  positive  starting  vectors.  By  part  3  of  the  same  theorem,  if  B  is  fully  indecomposable,  the 
scaling  matrices  D\  and  Di  are  unique  up  to  a  scalar  multiple.  As  both  D1BD2  and  ( DiBD2)T  are 
doubly  stochastic  and  D\  and  D2  are  unique  up  to  a  scalar  multiple,  d\  oc  c?2-  Therefore,  D\BDi 
and  D2BD2  are  scalar  multiples  of  each  other  and  of  a  doubly  stochastic  matrix. 

In  the  nonsymmetric  Sinkhorn-Knopp  iteration,  let  us  associate  the  following  vectors:  x2k  =  rk 
and  x2k+1  =  ck.  As  the  nonsymmetric  iteration  converges,  let  r  =  linifc^oo  rk  and  c  =  limfc-xx,  ck .  By 
our  previous  comments,  r  oc  c  and  RBC  is  doubly  stochastic.  Let  r  =  (3c.  Then  RBR  =  (3RBC  and 
CBC  =  P~1CBC.  If  the  iteration  terminates  for  k  odd,  the  constant  a  =  /?;  if  even,  a  =  (3~1 .  □ 


The  theorem  does  not  hold  if  B  is  reducible.  Consider  the  matrix  B  =  diag(l  2)T.  If  x°  =  e, 
the  even  iterates  remain  e  while  the  odd  iterates  immediately  converge  to  v  =  (1  1/2)T.  Of  course 
IBV  equilibrates  B.  The  symmetric  equilibration  vector  is  (1  l/\/2)T  =  yfev. 

That  the  symmetric  Sinkhorn-Knopp  may  not  converge  on  reducible  matrices  is  not  a  problem. 
The  blocks  of  the  reducible  matrix  B  decouple  the  problem,  and  one  can  construct  a  symmet¬ 
ric  equilibrating  vector  x  from  the  nonsymmetric  equilibrating  vectors  r  and  c.  Suppose  r  and  c 
equilibrate  B  by  RBC.  Let  q  =  c/r,  and  let  I  be  the  indices  of  a  subset  of  identical  elements 
of  q.  If  RBC  is  doubly  stochastic,  so  is  ( aR)B(a~1C )  for  a  scalar  a.  Set  a  =  \fcjr  so  that 
ar(T)  =  a_1c(I)  =  yJr(I)c(I)  =  x{l).  Repeating  this  procedure  for  each  such  subset  of  indices 
yields  the  symmetric  equilibration  vector  x. 

Our  algorithms  are  motivated  by  the  symmetric  Sinkhorn-Knopp  iteration.  They  average  iterates 
and  so  do  not  exhibit  the  oscillation  we  have  just  discussed. 
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2.3.2  Stochastic  binormalization 

The  symmetric  Sinkhorn-Knopp  algorithm  uses  the  matrix-vector  product  Bx.  If  A  is  a  general 
symmetric  matrix,  then  Bij  =  \Aij \p  for  p  >  1,  and  so  B  is  not  available  if  one  does  not  have  access 
to  the  elements  of  A.  How  can  we  compute  Bx,  at  least  approximately,  by  using  a  matrix-vector 
product  with  A  rather  than  B7 

Let  a  be  a  vector,  possibly  random.  Suppose  a  and  the  random  vector  u  are  independent. 
Lemma  8.  If  the  elements  ofuG  1"  have  zero  mean,  positive  and  finite  variance,  and  are  iid,  then 

E(aTu)2  =r)EaTa.  (2.17) 


for  finite  r]  >  0. 

Proof.  Because  EuiUj  =  0  if  i  7^  j,  E  ajuj')  =  E  Ej  =  vJ2j  a %  where  77  =  Eu2  >  0  is 
finite.  □ 


For  use  in  Chapter  3,  let  /  be  the  pdf  of  each  element  of  u.  Suppose  /  is  symmetric  around  zero. 
Then  normalization  is  permitted,  as  demonstrated  in  the  following  lemma. 

Lemma  9.  If  the  elements  of  u  G  M™  have  positive  and  finite  variance  and  are  iid  with  pdf  f ,  and 
f  is  symmetric  around  zero,  then 


E 


( aTu )2 


=  n  1EaTa. 


Proof.  A  cross  term  now  has  the  form 


E 


U1U2 


(2.18) 


We  evaluate  the  integral  as  follows: 


E 


uiu2 


rid'b 

3>  1 


The  integrand  in  the  inner  integral  is  antisymmetric  around  zero  and  so  the  inner  integral  is  zero. 
Therefore,  (2.18)  is  also  zero. 

As  the  cross  terms  are  again  zero, 


E 


(Ej  ajUi) 


E ,  «; 


=  E 


2  2 
J  a3Ui 


E 

Ej  u2 


=  riEJ2au 


where  r]  =  E 


E 


2  ’ 
3U3 
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and  as 


nV  —  E 


E 


u2 

J  aJ 


=  E 


E, 


E,- 


=  1, 


we  have  p  =  n 


□ 


A  vector  u  whose  elements  are  iid  normal  random  variables  obeys  the  conditions  of  Lemma  9.  Such 
a  vector  samples  uniformly  from  the  unit  sphere  [47] . 

If  a  is  complex  valued,  both  Lemmas  8  and  9  generalize  simply  by  considering  ( aTu)(aTu )  rather 
than  ( aTu )2.  We  present  our  algorithms  using  notation  suitable  only  for  real- valued  matrices,  but 
they  immediately  extend  to  complex-valued  matrices. 

Returning  to  our  problem,  we  can  approximate  Bx  when  p  =  2  by  computing  the  matrix-vector 
product  AX1t2u,  where  u  obeys  the  assumptions  of  either  Lemmas  8  or  9.  By  Lemma  8,  for  example, 


E  (AX1/2u)2  =  p(AX1/2)2e  =  rj(A  o  A)Xe  =  rjBx. 


(2.19) 


To  increase  the  accuracy  of  the  estimate,  one  could  compute  the  mean  of  multiple  matrix-vector 
products  AX'  !2u.  Then  one  could  construct  an  approximate  scaling  algorithm  by  replacing  the 
exact  computation  Bx  with  this  estimate  in  the  symmetric  Sinkhorn-Knopp  algorithm. 

However,  the  methods  of  stochastic  approximation  suggest  a  better  approach.  In  stochastic 
approximation,  the  exact  iteration 

xk+1  =xk+tokf(xk)  (2.20) 


is  replaced  by  the  stochastic  iteration 
xk+1=xk+u >kf(xk), 


where  E  /( xk)  =  f(xk )  [38] .  If  several  conditions  are  satisfied — some  of  which  are  not  straightforward 
to  verify  —then  the  second  iteration  converges  with  probability  1  to  a  limit  point  of  the  first. 

Recall  that  the  symmetric  Sinkhorn-Knopp  iteration  is  x+  =  ( Bx )-1.  A  scalar  does  not  matter, 
and  so  we  can  use  the  iteration  x+  =  ||Ra;||]"1(I3x)_1.  Let  us  rewrite  the  latter  as  follows.  First, 
d  =  x~x.  Second,  let  u  =  1.  Then 


,,  Bx  d 

d  —  II  D  II  —  (1  ~  - 

ll-B*lli  IMIli 

If  to  ^  1,  we  have  the  new  iteration 


dk+1  =  (  1-W)— 


Bx 

PEE 


\\dk\\i  ll-Sa;fc||i' 


(2.21) 
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The  iteration  takes  a  convex  combination  of  the  current  d  and  the  Sinkhorn-Knopp  update  of  d 
when  each  of  the  two  terms  is  normalized  by  its  1-norm  and  0  <  u>  <  1.  Furthermore,  dk  in  (2.21) 
is  related  to  dk  in  the  iteration 

dk+ 1  =  dk  +  w  (~dk  +  Hx'A 


by  a  scalar,  and  the  latter  iteration  has  the  form  of  (2.20)  if  u>  is  replaced  by  the  sequence  uik. 
We  shall  consider  the  stochastic  iteration  corresponding  to  (2.21)  after  we  discuss  some  convergence 
properties  of  the  latter. 

Sinkhorn  and  Knopp  [62]  proved  that  their  iteration  converges  globally  if  the  matrix  has  support; 
Parlett  and  Landis  [57]  generalized  the  global  convergence  proof  to  all  iterations  that  satisfy  certain 
conditions;  and  Theorem  8  proves  the  symmetric  Sinkhorn-Knopp  iteration  converges  globally  if  the 
symmetric  matrix  B  is  irreducible  and  fully  indecomposable.  It  appears  that  none  of  the  methods 
of  proof  can  be  extended  to  show  the  global  convergence  of  the  iteration  (2.21)  for  two  principal 
reasons:  our  iteration  is  for  the  symmetric  problem,  and  nonsymmetric  scaling  is  essential  to  some 
of  the  steps  of  the  classic  proofs;  and  the  additional  term  (1  —u>)d  means  row  or  column  stochasticity 
is  not  alternately  achieved.  However,  we  have  the  following  partial  results. 

Lemma  10.  The  iteration  (2.21),  with  0  <  to  <  1,  has  a  fixed  point  d*  >  0  if  and  only  if  B  has 
total  support. 

Proof.  Suppose  the  iteration  (2.21)  has  a  fixed  point  d*: 


d* 


G(d*)  =  (1  -  u) 


d* 

Pik 


+  UJ 


Bx* 


Let  s*  =  HBxlk  1 .  d*  is  a  convex  combination  of  two  unit-l-norm  vectors  and  so  itself  has  unit 
1-norm.  Hence 


d*  =  (1  -co)d*  +us*Bx* 

s*X*Bx*  =  e,  (2.22) 

and  so  d*  is  a  scaling  vector.  As  B  is  scalable,  it  has  total  support. 

From  the  other  direction,  by  Theorem  2,  x  >  0  exists  such  that  XBx  =  ce,  c  >  0,  and  ||d||i  =  1, 
if  B  has  total  support.  Then  Bx  =  cd.  Substituting  this  expression  into  (2.21)  shows  that  d  is  a 
fixed  point.  Qi 

Theorem  9.  If  A  is  symmetric  and  has  total  support,  then  a  fixed  point  d*  >  0  of  (2.21)  is  a  point 
of  attraction  for  0  <  u>  <  1. 

We  shall  require  some  supporting  results. 
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Lemma  11.  Let  A  £  Mnxn  be  symmetric,  and  let  {Ai,Ui}  be  its  eigenvalues  and  eigenvectors.  For 
a  £  R",  the  eigenvalues  of  A  —  v\aT  are  {Ai  —  aTv i,  A2, . . . ,  A„}. 

Proof.  A  is  diagonalized  as  A  =  V AVT .  A  —  v\aT  is  similar  to 

VT(A  —  V\  aT)V  =  A  —  ei  bT , 

where  b  =  VT a.  The  matrix  on  the  rhs  is  upper  triangular  and  so  its  eigenvalues  lie  on  its  diagonal. 
Only  the  first  eigenvalue  Ai  is  altered:  it  is  Ai  minus  the  first  element  of  b ,  which  is  just  aTV\.  □ 


Theorem  10.1.3  of  [55].  Suppose  that  G  :  D  C  R”  — >  R"  has  a  fixed  point  x*  in  the  interior  of  D 
and  is  Frechet- differentiable  at  x* .  If  the  spectral  radius  of  G'(x*)  satisfies  p(G'(x*))  =  a  <1,  then 
x*  is  a  point  of  attraction  of  the  iteration  xk+1  =  G{xk). 

Lemma  12.  The  iteration  (2.21)  is  invariant  to  symmetric  permutations. 

Proof.  Let  P  be  a  permutation  matrix.  Then 


IMIli  =  ||Pd||i 

||l?a;||1  =  ||(P13PT)(Pa;)||1. 

Hence  if  dk  and  dk+1  satisfy  (2.21),  then 


||™*||i  ||(PBPP(Pi*)||1’ 

and  so  Pdf  and  Pdk+1  satisfy  (2.21)  for  the  matrix  PBPT . 


□ 


Proof  of  Theorem  9.  First  we  assume  A  is  irreducible. 

By  Theorem  10.1.3  of  [55],  the  theorem  follows  for  irreducible  A  if  p{S7 dG(d)\d=d.*)  <  1-  The 
Jacobian  is 


VdG(d)  =  VxG{d)Vdx  =  -VxG(d)X2 


1  —  UJ  T  1  —  LU  -  T 

I - ,,  ,MO  de  — 


UJ 


IMII 


Ml 


\\Bx\\ 


BX2 


UJ 


\\Bx\\\ 


Bxe1  BX2 


By  Lemma  10,  a  fixed  point  x  =  x*  >  0  is  a  scaling  vector.  At  such  a  point,  Bx  =  s  1d  by  (2.22) 
and  ||d||i  =  1.  Hence 


VdG(d)  |d=d. 


(1  —  u>)I  —  (1  —  oj)deT  —  usBX 2  +  tos2BxeTBX2. 
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Noting  that  Xd  =  e,  we  use  X  to  obtain  the  similar  matrix 

XX dG(d)\d=d*X~l  =  (1  -  w)I  -  (1  -  ui)edT  +  ujsXBX  -  us2XBxeTBX.  (2.23) 

As  XBx  =  s~1e  and  eT  =  dTX,  XBxeT BX  =  s~1eeT  BX  =  s~1edT  XBX ,  and  this  is  substituted 
into  the  fourth  term  of  (2.23).  Combining  the  second  and  fourth  terms,  we  obtain  the  rank-one 
matrix 

Q  =  e[{w  —  l)dT  —  ujsdT  XBX], 

which  has  the  nonzero  eigenvalue  \{u>  —  l)dT  —  cosdT XBX]e.  As  sdT XBXe  =  seT Bx  =  seTs~1d  = 
eTd,  the  eigenvalue  is  (u>  —  1  )dTe  —  coeTd  =  —eTd  =  —  ||d||i  =  — 1. 

Because  sXBx  =  e  and  B  is  symmetric,  C  =  sXBX  is  doubly  stochastic.  First,  p{C)  <  1. 
Second,  we  already  showed  in  the  proof  of  Lemma  2  that  Amin(C')  >  — 1.  Since  C  is  symmetric,  the 
only  eigenvalue  having  unit  magnitude  is  1.  As  for  the  moment  we  assume  A  is  irreducible,  by  the 
Perron-Frobenius  theorem  (see,  for  example,  Theorem  8.4.4  in  [28]),  the  eigenvalue  1  is  simple. 

Next,  the  rank-one  matrix  Q  has  the  right  eigenvector  e.  e  is  also  a  right  eigenvector  of  C, 
associated  with  the  eigenvalue  1.  Therefore,  by  Lemma  11,  C  =  C+Q  has  the  same  eigenvalues  as  C, 
except  that  the  single  eigenvalue  1  is  now  0.  We  conclude  that  p{C)  <  1;  therefore,  p{{l—w)I+wC)  < 
1  as  well,  and  so  p(X dG(d)\d=d*)  <  1,  as  desired. 

Now  suppose  A  is  reducible.  By  Lemma  12,  we  can  assume  A  is  block  diagonal  and  each  block  is 
irreducible.  Our  analysis  to  this  point  applies  to  each  block  separately.  As  the  Jacobian  associated 
with  each  block  has  spectral  radius  less  than  1,  so  does  the  Jacobian  as  a  whole.  □ 

Additionally,  we  can  show  global  convergence  for  the  simpler  iteration 

dk+ 1  =dk  +  ak(—dk  +  Bxk )  =dk  +  akh(dk),  (2.24) 

where  ak  is  an  appropriate  step  size.  One  way  this  iteration  arises  is  as  follows.  Suppose  we  solve  the 
nonlinear  equation  f(d)  =  d—Bx  =  0  using  Newton’s  method.  The  Jacobian  is  J  =  fd(d )  =  I+BX 2, 
and  so  Newton’s  method  yields  the  iteration 

(I  +  BX2)(dk+1  -  dk )  =  -dk  +  Bxk. 

If  we  approximate  the  Jacobian  by  J  ss  /  and  introduce  a  step  size,  we  obtain  (2.24).  However,  we 
treat  the  latter  as  a  minimization  method  so  that  we  can  use  certain  techniques  in  our  proof. 

We  use  the  following  theorem. 


Propositions  1.2.1  and  1.2.2  of  [2].  Let  {d^1}  be  a  sequence  generated  by  a  gradient  method 
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dk+1  =  dk  +  akhk ,  and  assume  that  {hk}  is  gradient  related  and  ak  is  chosen  by  the  minimization 
rule,  the  limited  minimization  rule,  the  Goldstein  rule,  or  the  Armijo  rule.  Then  every  limit  point 
of  {dk}  is  a  stationary  point. 

Various  conditions  assure  {gk}  is  gradient  related;  we  use  the  following:  hk  =  —HkV f(dk),  and  the 
eigenvalues  of  the  positive  definite  symmetric  matrix  Hk  are  bounded  above  and  below:  Ami n(Hk)  > 
ci  and  A max(#fc)  <  c2. 

Theorem  10.  If  A  is  symmetric  and  An  ^  0,  then  the  iteration  (2.24)  converges  globally  if  ak  is 
chosen  according  to  one  of  the  step  size  rules  in  Propositions  1.2.1  or  1.2.2  of  [2]. 

Proof.  Lemmas  1  and  2  show  that  the  global  minimizer  of  the  function  /(x)  defined  in  (2.2)  is  a 
scaling  vector.  To  review,  we  define 

f(x)  =  ^xtBx  -  ^2  1  nxj 

i 

g{x)  =  Vx/(x)  =  Bx-d 
H(x)  =  V2J(x)  =  B  +  D2. 

f  and  its  derivatives  are  defined  with  respect  to  x  rather  than  d;  in  contrast,  the  iteration  (2.24)  is 
defined  with  respect  to  d.  f  and  its  derivatives  are  transformed  as  follows: 

fd(d)  =  ^xtBx  -  In  Xi  =  f(x) 

i 

gd(d)  =  Vdf(d)  =  -X2g(x)  =  -X2Bx  +  x 
Hd(d)  =  V2J(d)  =  2X3diag  (g{x))  +  X2H(x)X2. 

If  H{x)  is  positive  definite  (as  Lemma  2  shows),  then  so  is  Hd(d):  X2H(x)X2  y  0  because  a 
p.d.  matrix  remains  p.d.  if  pre-  and  post-multiplied  by  symmetric  (in  this  case,  diagonal)  matrices; 
and  2X3diag(g(x))  is  a  p.d.  diagonal  matrix. 

In  the  final  paragraph  of  the  proof  to  Lemma  1,  we  observe  that  given  x  >  0,  there  exists  a 
compact  set  LI  containing  x.  As  the  minimizer  of  /  lies  in  LI,  the  iteration  (2.24)  has  a  limit  point. 
In  the  same  proof  we  observe  that  /  — >  oo  as  any  x,  — >  0  or  Xj  — >  oo.  Hence  LI  C  {x  :  x  >  0}.  Let  x 
be  the  initial  point  of  the  iteration.  Let  us  define  the  following  quantities  relative  to  the  boundary 
of  LI: 


dL  =  inf  min  di  and  djj  =  sup  max  di . 
xedfi  i  xedn  i 

As  any  step  size  rule  in  Propositions  1.2.1  and  1.2.2  of  [2]  is  such  that  fk+1  <  fk ,  the  iterates 
remain  in  LI.  Hence  for  each  iterate  d  and  element  dj,  d^  <  d;  <  djj  and  so  the  eigenvalues 
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of  the  matrix  D 2  are  bounded  from  below  and  above  by  d\  and  dfj  respectively.  Therefore,  as 
—D2gd(d)  =  g(x )  =  Bx  —  d  =  h(d),  h{d)  is  gradient  related.  □ 


Substituting  (2.19)  into  (2.21)  and  replacing  w  with  uik ,  we  obtain  the  stochastic  iteration 


yk  =  ( A{Xk)1/2uk f 
dk+1  =  (1  + 


UJ 


k  v 


m  \\ykh' 

Based  on  numerical  experiments,  we  choose  cok  according  to  the  rule 
-log2wfe  =  max(min(|k)g2(fc)J  -  1,4),  1). 


(2.25) 


(2.26) 


Consequently,  1/16  <uik<  1/2  for  all  k,  and  uik  decreases  by  a  factor  of  2  at  particular  values  of  k. 
This  sequence  for  uik  allows  for  large  changes  in  d/||d||i  when  k  is  small  and  smaller  changes  when  k 
is  large.  Our  numerical  experiments  show  that  the  maximum  number  of  iterations  is  a  small  fraction 
of  the  size  of  the  matrix;  otherwise,  the  algorithm  would  not  be  useful. 

The  iteration  (2.25)  is  straightforwardly  generalized  to  the  nonsymmetric  problem.  Let  p  and  7 
be  such  that  pi  =  r”1  and  7 j  =  c/1.  Let  v  be  a  second  random  vector.  Each  iteration  now  performs 
a  forward  and  a  transpose  matrix- vector  product: 


yk  =  ( A(Ck)1/2uk )2 
-k+1  =  {l-cok)^— 


■  L 0 


k  y 


nil  \\yk\\i 

ZK  =  (Ar(i?fc)1/V)2 


7 


=  (i-wfe) 


b* 


+  UJK 


kfcnr 


(2.27) 


We  omit  analysis  of  both  the  corresponding  deterministic  and  this  iteration  because  our  results  would 
again  be  at  best  partial.  But  we  show  the  results  of  extensive  numerical  experiments  in  Section  2.4. 
For  the  problem  of  square  nonsymmetric  matrices,  Theorems  5  and  6  completely  characterize  the 
class  of  e-scalable  matrices;  but  we  have  not  investigated  the  existence  and  uniqueness  theory  for 
rectangular  matrices. 

Using  the  convergence  theory  of  stochastic  approximation  (see,  e.g.,  Chapters  4-6  of  [38]),  it  may 
be  possible  to  show  that  if  a/’  is  a  sequence  obeying  certain  conditions  (in  particular,  lim^oo  u>k  =  0, 
and  so  the  sequence  (2.26)  does  not  conform),  then  the  sequences  (2.25)  and  (2.27)  converge  with 
probability  1  to  scaling  vectors.  We  have  not  succeeded  in  this  analysis  for  these  or  similar  iterations; 
the  analysis  is  quite  technical  and  takes  us  far  afield.  In  any  case,  our  algorithm  is  practical  only 
if  a  small  number  of  iterations  succeeds  in  conditioning  the  matrix  sufficiently,  and  so  theoretical 


30 


CHAPTER  2.  EQUILIBRATION  OF  MATRICES 


function  x  =  ssbin  (A,  nmv,  n) 

%  Stochastic  matrix— fr  e  e  binormalization  for  symmetric  real  A. 

%  x  =  s s  bin  (A,  nmv ,  n ) 

%  A  is  a  symmetric  real  matrix  or  function  handle.  If  it  is  a 
%  function  handle  ,  then  v  =  A(x)  returns  A*x. 

%  nmv  is  the  number  of  matrix— v ect or  products  to  perform. 

%  [n]  is  the  size  of  the  matrix.  It  is  necessary  to  specify  n 

%  only  if  A  is  a  function  handle. 

%  diag(x)  A  diag(x)  is  approximately  binormalized . 
op  =  isa(A,  ’function_handle’); 
if(~op)  n  =  size(A,l);  end 
d  =  ones (n , 1 ) ; 
for  ( k  =  1 : nmv) 
u  =  randn(n , 1 )  ; 
s  =  u . / sqrt (d ) ; 

if(op)  y  =  A(s);  else  y  =  A*s;  end 

omega  =  2/'(— max(min(  floor  ( log2  (k) )  —  1  ,4)  ,  1 ) ) ; 

d  =  (1  — omega)  *d/sum(d)  +  omega*y.A2/sum(y/2); 

end 

x  —  1 ./ sqrt (d ) ; 


Figure  2.1:  Matlab  implementation  of  the  symmetric  stochastic  binormalization  algorithm  SSBIN. 


results  that  apply  in  the  limit  k  — ►  oo  are  not  necessarily  relevant.  In  the  end,  we  have  chosen  to 
motivate  our  algorithms  by  examining  a  deterministic  sequence  and  then  demonstrate  their  efficacy 
on  a  large  test  set,  to  the  results  of  which  we  now  turn. 


2.4  Numerical  experiments 

Framework 

Figure  2.1  shows  a  Matlab  implementation  of  SSBIN,  the  stochastic  binormalization  algorithm  for 
symmetric  matrices  corresponding  to  the  sequence  (2.25);  and  Figure  2.2  shows  an  implementation 
of  SNBIN,  the  stochastic  binormalization  algorithm  for  nonsymmetric  matrices  corresponding  to  the 
sequence  (2.27). 

We  test  our  algorithms  in  Matlab  on  problems  in  the  University  of  Florida  Sparse  Matrix 
Collection  [17].  We  use  the  following  Matlab  code  to  obtain  matrices  obeying  our  criteria: 

index  =  UFget  (’ refresh  ’) ; 

%  Symmetric 

sids  =  find  (  index  .  nnz  <=  le7  Sz  ~  index  .  isBinary  Sz . . . 

index  .  numericaLsymmetry  =  1  Sz . . . 

index,  sprank  =  index  .  nrows  Sz  index  .  isReal  ) ; 

%  Square  nonsymmetric 

nids  =  find  ( index  .  nnz  <=  le7  Sz  ~  index  .  isBinary  Sz  . . . 
index  .  nrows  =  index  .  ncols  Sz  . . . 
index  .  sprank  =  index  .  nrows  &  index  .  isReal  ) ; 

%  Rectangular 
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function  [x  y]  =  snbin  (A,  nmv,m,  n) 

%  Stochastic  matrix— fr  e  e  binormalization  for  nonsymmetric  real  A. 

%  [x  y ]  =  snbin  (A,  nmv ,m,n) 

%  A  is  a  matrix  or  function  handle.  If  it  is  a  function  handle,  then 
%  v  =  A(x)  returns  A*x  and  v  =  A(x,  ’ trans  ’)  returns  A’*x. 

%  nmv  is  the  number  of  matrix—  v ect or  products  to  perform. 

%  m,n  is  the  size  of  the  matrix.  It  is  necessary  to  specify  these  only  if 

%  A  is  a  function  handle. 

%  diag(x)  A  diag(y)  is  approximately  binormalized . 
op  =  isa(A,  ’  function_handle  ’  )  ; 
if(~op)  [m  n]  =  size(A);  end 
r  =  ones  (m,  1 )  ;  c  =  ones(n,l); 
for  ( k  =  1 : nmv) 

omega  =  2"  ( —  max(min(  floor  ( log2  ( k ) )  —  1  , 4)  ,  1 ) ) ; 

s  =  randn (n , 1 ) .  /  sqrt ( c )  ; 

if(op)  y  =  A(s);  else  y  =  A*s;  end 

r  =  (1  —  omega)  *  r  /sum(  r  )  +  omega*y .  "  2/sum(y  .  ~  2  ) ; 

s  =  randn (m,  1 )./ sqrt ( r ) ; 

if(op)  y  =  A(s  ,  ’trans’);  else  y  =  (s  ’*A)  ’ ;  end 
c  =  (1  —  omega)  *  c/sum(  c  )  +  omega*y .  "  2/sum(y  .  ~  2  ) ; 

end 

x  =  1 . / sqrt ( r )  ; 
y  =  1./ sqrt  (c  ) ; 

Figure  2.2:  Matlab  implementation  of  the  nonsymmetric  stochastic  binormalization  algorithm 

SNBIN. 


rids  =  find  (  index. nnz  <=  le7  &  ~  index  .  isBinary  &... 
index  .  nrows  ~=  index  .  ncols  &... 

index  .  sprank  =  min(  index  .  nrows  ,  index  .  ncols  )  &  index  .  isReal  ) ; 

The  predicate  index .  sprank  ==  index .  nrows  assures  that  the  matrices  have  full  structural  rank, 
the  condition  Theorems  3  and  5  give  for  a  square  matrix  to  be  e-scalable.  (But  recall  that  we  do 
not  have  a  theorem  to  support  rectangular  matrices.)  At  the  time  of  our  experiments,  only  three 
complex-valued  matrices  in  the  collection  have  full  structural  rank  and  are  symmetric,  and  so  we 
focus  on  only  real-valued  matrices. 

Our  primary  metric  for  assessing  the  performance  of  the  algorithms  is  the  reduction  in  condition 
number  of  the  matrix.  A  secondary  metric,  used  to  verify  that  our  algorithms  indeed  approximately 
equilibrate  matrices,  is  based  on  normalized  variance.  The  normalized  variance  of  the  vector  v  is 


nvar(u) 


-Mill  var(v) 


where  /i  is  the  mean  of  the  elements  of  v  and  var(u)  is  their  variance.  A  useful  property  is  that 
0  <  nvar(u)  <  1.  For  symmetric  matrices,  we  use  the  normalized  variance  of  the  row  2-norms, 
denoted  NVR  or  NVR(A).  For  nonsymmetric  matrices,  we  use  the  maximum  of  the  normalized 
variance  of  the  row  and  column  2- norms,  denoted  MVR  or  MVR(A). 

To  evaluate  the  condition  number  of  a  matrix  A ,  we  use  the  Matlab  function  scond  shown  in 
Figure  2.3.  This  function  uses  the  sparse  methods  eigs  and  svds  to  obtain  the  extremal  singular 
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function  c  =  scond  (A,  issym  ) 

%  cond  for  sparse  matrices  . 

%  c  =  scond  (A,  issym ) 

%  A  is  a  sparse  matrix. 

%  issym  is  a  boolean:  true  if  A  is  symmetric,  false  otherwise  . 
if(nargin  <  2)  issym  =  false;  end 
si  =  []  ;  sn  =  []  ; 

opts  =  struct  (  ’maxit  ’  ,le3,  ’  disp  ’  ,0,  ’issym  5  ,  issym  ,  ’isreal  ’  ,  true  ) ; 

[m  n]  =  size  (A) ; 
try 

if  (issym)  %  Symmetric 

si  =  abs  (  eigs  (A,  1  ,  ’LM’  ,  opts  ) )  ; 
sn  =  abs  (  eigs  (A,  1  ,  ’SM’  ,  opts  ) )  ; 
elseif(m=  n)  %  Nonsymmetric  square 
si  =  svds (A,  1 ,  ’L ’  , opts ) ; 
sn  =  svds (A,  1  , 0 , opts ) ; 
else  %  Rectangular 

si  =  svds (A,  1 ,  ’L ’  , opts ) ; 
if(m  <  n)  B  =  A*A’;  else  B  =  A’*A;  end 
opts  .  issym  =  true  ; 
sn  =  sqrt  (  eigs  (B,  1  ,  ’SM ’,  opts  ))  ; 
end 
catch 
end 

i  f  ( isempty  ( si )  ||  isempty(sn))  c  =  0;  else  c  =  sl/sn;  end 

Figure  2.3:  Matlab  function  to  compute  the  condition  number  of  the  sparse  matrix  A. 


values  of  a  matrix.  To  find  the  smallest  singular  value  in  the  thin  SVD  of  the  rectangular  matrix 
A,  scond  forms  the  product  B  =  AAT  or  AT A,  depending  on  the  shape  of  the  matrix,  and  then 
uses  eigs.  We  found  that  this  method  is  more  robust,  from  the  perspective  of  our  software  testing 
framework,  than  applying  svds  directly  to  A. 

SSBIN  and  SNBIN  are  useful  only  if  the  number  of  matrix-vector  products  is  significantly  smaller 
than  the  size  of  the  matrix.  We  assess  the  results  of  the  algorithms  when  they  are  allowed  K  =  [ pn\ 
iteration,  where  p  <  1  and  n  is  the  size  of  the  matrix.  If  the  matrix  is  rectangular,  n  is  the  larger  of 
the  number  of  rows  and  columns.  For  SSBIN,  one  iteration  corresponds  to  one  matrix-vector  product; 
for  SNBIN,  two,  one  with  the  matrix  and  one  with  its  transpose.  We  test  four  values  oi p:  0.5%,  1%, 
2%,  5%.  If  a  matrix  is  too  small,  then  K  is  too  small;  therefore,  we  consider  matrices  having  size 
n  >  200,  so  that  the  four  values  of  p  correspond  to  minimum  values  of  K  of  1,  2,  4,  10.  Of  course, 
in  practice,  one  likely  has  a  much  larger  matrix. 

Since  our  algorithms  are  stochastic,  their  performance  on  a  matrix  varies  with  each  application. 
We  run  the  stochastic  algorithms  five  times  for  each  matrix  and  accumulate  the  results. 

Lemmas  8  and  9  allow  for  a  variety  of  distributions  for  the  random  vector  involved  in  the  matrix- 
vector  product.  Chen  and  Demmel  [14]  use  the  random  vector  z  such  that  Zi  £  {—1, 1}  is  iid,  each 
value  1  and  —1  having  equal  probability.  Their  Lemma  6  shows  that  this  distribution  has  minimal 
variance  among  all  distributions  such  that  Ezi  =  0  and  E  zf  =  1.  They  connect  minimal  variance  to 
fast  convergence  in  estimates  based  on  2.  We  find  that  the  vector  u  such  that  Ui  ~  N( 0, 1)  produces 
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cond(X1/2  A  X1/2)  (N=227,  MVP=0.5%)  cond(X1/2  A  Xia)  (N=228,  MVP=1 .0%) 


cond(X1/2  A  X1/2)  (N=228,  MVP=2.0%) 


cond(X1/2  A  X1/2)  (N=227,  MVP=5.0%) 


(d) 


Figure  2.4:  The  performance  of  all  methods  for  symmetric  matrices.  The  title  of  each  figure  indicates 
the  number  of  matrix-vector  products  and  the  performance  metric.  MVP  =  p%  means  that  [fonJ 
matrix-vector  products  are  used  for  a  matrix  of  size  n.  In  these  plots,  the  performance  metric  is  the 
condition  number  of  the  scaled  matrix. 


slightly  better  results  for  our  algorithms,  and  we  always  use  this  distribution.  Other  algorithms  we 
discuss  in  this  section  use  2.  Two  sets  of  test  results  compare  u  and  z. 

For  convenience,  we  refer  to  the  algorithm  that  uses  the  identity  matrix  as  a  scaling  matrix  as 
algorithm  A.  This  algorithm  corresponds  to  doing  nothing. 

We  plot  results  in  three  primary  formats.  Figure  2.7  uses  all  three,  and  so  we  refer  to  this  figure 
in  the  following  descriptions. 

•  We  display  the  distribution  in  sizes  of  the  matrices  in  our  test  sets  by  a  plot  of  cumulative 
density.  Let  (n,  y )  be  a  point  on  the  blue  curve.  At  least  y%  of  the  test  matrices  have  size  at 
least  n.  See  Figure  2.7(a)  for  an  example. 

•  Let  the  performance  of  an  algorithm  on  a  particular  problem  be  characterized  by  a  single 
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cond(X1/2  A  X1/2)  (N=329,  MVP=1 .0%)  cond(X1/2  A  X1/2)  (N=328,  MVP=5.0%) 


(a) 


(b) 


Figure  2.5:  The  performance  of  all  stochastic  methods  for  symmetric  matrices. 


cond(X1/2  A  X1/2)  (N=108,  MVP=5.0%)  cond(X1/2  A  X1/2)  (N=119,  MVP=5.0%) 


Figure  2.6:  The  performance  of  all  methods  for  symmetric  matrices;  p  =  5%.  The  matrices  are  (a) 
positive  definite  and  (b)  indefinite. 


scalar  metric  x  such  that  a  smaller  \  indicates  better  performance.  Suppose  K  algorithms 
are  run  on  each  problem  in  a  test  set  of  size  N .  For  problem  n,  let  the  smallest  metric  value 
be  x™-  Dolan  and  More  [19]  introduced  performance  profiles  as  shown  in  Figure  2.7(b)  and, 
as  a  more  complex  example,  each  of  the  plots  in  Figure  2.5.  Each  curve  corresponds  to  an 
algorithm.  Consider  the  point  (x,  y )  on  curve  k.  The  interpretation  is  that  algorithm  k  yields 
metric  values  {Xn}n= l  that  are  within  a  factor  x  of  {xn}n= l  on  V  percent  of  problems.  The 
title  of  each  plot  gives  the  test  set  size  N,  and  the  legend  indicates  the  algorithms  considered. 

A  simple  rule  of  thumb  when  looking  at  performance  profiles  is  that  curves  corresponding  to 
better  algorithms  go  to  y  =  100%  faster. 

•  The  scatter  plot,  an  example  of  which  is  Figure  2.7(c),  compares  one  algorithm,  for  each  of  the 
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(a) 


NVR(X1/2  A  X1/2)  (N=340,  MVP=5.0%) 


(b)  (c) 


cond(X1/2  A  X1/2)  (N=340,  MVP=5.0%) 


Figure  2.7:  The  performance  of  SBIN  on  symmetric  matrices,  (a)  Matrix  sizes,  (b,  c)  The  perfor¬ 
mance  metric  is  the  normalized  variance  of  the  row  2-norms  (NVR).  (d,  e)  The  performance  metric 
is  the  condition  number. 
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cond(R1/2  A  C1/2)  (N=615,  MVP=0.5%)  cond(R1/2  A  C1/2)  (N=615,  MVP=1.0%) 


cond(R1/2  A  C1/2)  (N=620,  MVP=2.0%)  cond(R1,z  A  C1/2)  (N=613,  MVP=5.0%) 


(c) 


(d) 


Figure  2.8:  The  performance  of  all  methods  for  nonsymmetric  square  matrices. 


four  values  of  p,  against  algorithm  A.  A  point  (x,y)  having  the  /cth  color,  as  indicated  in  the 
legend,  is  interpreted  as  follows.  Algorithm  A  yields  the  metric  value  x,  while  the  algorithm 
yields  the  value  y.  The  red  line  is  for  convenience:  any  point  below  the  red  line  improves  upon 
algorithm  A.  The  points  corresponding  to  p  =  5%  are  bolder  than  the  others. 

When  constructing  a  plot  of  results,  we  discard  any  matrix  for  which  we  do  not  have  results 
from  every  algorithm  included  in  the  plot.  Matlab’s  eigs  and  svds  functions  can  consume  a  large 
amount  of  memory  on  some  large  matrices,  causing  the  Matlab  process  to  die.  In  other  cases,  the 
function  scond  obtains  an  erroneous  result  from  eigs  or  svds,  and  again  we  lose  a  measurement. 
Finally,  we  do  not  run  the  deterministic  algorithms  on  some  of  the  largest  problems  because  our 
Matlab  implementation  of  dbin  is  rather  slow.  It  would  be  be  better  to  implement  it  as  a  mex 
function.  Consequently,  we  often  plot  results  for  subsets  of  the  algorithms  on  a  larger  set  of  matrices 
in  addition  to  comparing  all  the  algorithms  on  a  smaller  set. 

We  also  discard  any  matrix  that  is  already  binormalized.  For  both  the  sets  of  symmetric  and 
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cond(R1/2  A  C1/2)  (N=641,  MVP=0.5%)  cond(R1,z  A  C1/2)  (N=645,  MVP=1.0%) 


cond(R1/2  A  C1/2)  (N=645,  MVP=2.0%) 


(c) 


cond(R1/2  A  C1/2)  (N=639,  MVP=5.0%) 


(d) 


Figure  2.9:  The  performance  of  all  stochastic  methods  for  nonsymmetric  square  matrices. 


nonsymmetric  square  matrices,  we  end  up  discarding  only  slightly  more  than  10  matrices  each  for 
this  reason.  But  quite  a  number  of  the  matrices  in  the  rectangular  set  are  already  binormalized. 


Symmetric  matrices 

We  test  other  algorithms,  both  deterministic  and  stochastic,  in  addition  to  ours.  In  the  case  of 
symmetric  matrices,  we  consider  the  following  additional  algorithms: 


•  NORM(u).  Compute  the  sample  mean  of  the  row  2-norms.  Use  u  rather  than  z.  norm(z) 
will  be  considered  in  the  experiments  involving  nonsymmetric  matrices.  Let  the  estimate  be 
r.  Scale  A  as  R~1/2AR~1/2. 
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Matrix  size  n 


(a) 


Figure  2.10:  The  performance  of  SNBIN  on  nonsymmetric  square  matrices. 


•  SDIAG.  The  simplest  of  Bekas,  Kokiopoulou,  and  Saad’s  [1]  stochastic  methods  to  estimate  the 
diagonal  of  a  matrix.  The  algorithm  uses  z  and  is  as  follows: 


N 
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The  algorithm  computes  N  matrix-vector  products  with  random  vectors  z1  and  then  computes 
a  sample  mean.  zl(AzL)  is  an  element-wise  product  between  zl  and  Az1 .  The  quantities 
ZjZj  =  1,  and  ZjZ\  £  {—1,1}  with  equal  probability,  and  so  the  diagonal  elements  of  A  sum 
while  the  off-diagonal  elements  tend  to  cancel  out.  Use  the  result  just  as  exact  Jacobi  scaling 
uses  the  diagonal.  The  method  is  not  intended  to  produce  a  preconditioner,  but  we  include 
it  in  our  numerical  experiments  because  the  diagonal  estimators  developed  in  [1]  are  the  only 
matrix-free  methods  of  which  we  know,  other  than  the  straightforward  NORM  methods,  that 
may  be  considered  to  produce  a  preconditioner. 

•  DBIN.  We  rename  Livne  and  Golub’s  [42]  method  SBIN  as  DBIN  for  clarity;  D  stands  for 
deterministic. 

•  DIAG.  Jacobi  scaling.  If  a  diagonal  element  is  0,  substitute  1. 

Each  iteration  of  each  of  these  algorithms  except  DIAG  performs  work  equivalent  to  one  matrix- vector 
product,  and  so  the  values  of  p  apply  equally  to  all  of  them.  DIAG  of  course  requires  very  little  work 
at  all. 

The  algorithms  are  related  to  each  other.  SSBIN  is  the  stochastic  matrix-free  algorithm  corre¬ 
sponding  to  the  deterministic  DBIN;  similarly,  SDIAG  corresponds  to  DIAG.  NORM(u)  is  like  SSBIN  in 
the  sense  that  each  iteration  involves  estimating  the  2-norm  of  the  rows  of  a  matrix. 

Figure  2.4  shows  performance  profiles  for  all  four  values  of  p  using  condition  number  as  the 
metric.  As  expected,  DBIN  does  best  on  average.  But  SSBIN  gets  close  as  p  increases.  Similarly, 
SDiAG’s  performance  approaches  that  of  DIAG  as  p  increases,  as  one  expects.  All  the  algorithms  do 
better  than  A  on  average. 

Whether  a  matrix  is  positive  definite  is  important.  Figure  2.6  shows  results  for  p  =  5%,  grouping 
the  results  into  those  for  positive  definite  and  indefinite  matrices.  Based  on  the  analysis  in  Section 
2.2.3,  we  expect  Jacobi  scaling  and  binormalization  to  have  similar  performance  on  positive  definite 
matrices,  and  indeed  this  expectation  is  quite  clearly  borne  out.  In  contrast,  on  indefinite  matrices, 
the  binormalization  algorithms  perform  significantly  better  than  the  Jacobi  scaling  algorithms. 

We  recall  again  that  Bekas,  Kokiopoulou,  and  Saad  [1]  had  no  intention  of  devising  a  method 
to  compute  a  diagonal  preconditioner  using  only  matrix-vector  products.  Moreover,  they  developed 
a  more  complicated  algorithm  that  we  did  not  implement.  It  uses  a  sequence  of  vectors  taken 
from  Hadamard  matrices  and  is  particularly  effective  on  banded  matrices.  Hence  any  criticism 
of  SDIAG  in  this  context  is  quite  unwarranted.  We  note  simply  as  a  sanity  check  that  even  on 
positive  definite  matrices,  SSBIN  significantly  outperforms  SDIAG.  On  indefinite  matrices,  a  diagonal 
estimator,  however  effective,  is  ultimately  limited  by  the  poor  performance  of  Jacobi  scaling. 

Next,  we  show  comprehensive  results  for  SSBIN.  Figure  2.7(a)  shows  the  sizes  of  the  test  matrices. 
To  confirm  that  SSBIN  equilibrates  matrices,  Figures  2.7(b,c)  show  performance  profiles  and  a  scatter 
plot  using  normalized  variance  of  the  row  2-norms  as  the  metric.  As  expected,  the  NVR  is  reduced 
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significantly  on  average.  Finally,  Figures  2.7(d,e)  show  the  same  types  of  plots  but  once  more  using 
condition  number  as  the  metric. 

The  dots  in  Figure  2.7(c)  have  a  horizontal  trend  at  slightly  above  0.1.  If  the  parameter  u>  in 
SSBIN  were  allowed  to  go  to  0  in  a  controlled  way  that  obeyed  certain  conditions,  then  the  theory 
of  stochastic  approximation  suggests  the  normalized  variance  should  go  to  0.  As  we  are  interested 
in  finite-iteration — indeed,  small-iteration — results,  we  keep  u>  fixed  after  decreasing  it  for  a  certain 
number  of  iterations.  We  believe  the  horizontal  scatter  of  dots  corresponds  to  the  lower  limit  of  the 
normalized  variance  that  can  be  achieved  given  the  final  fixed  value  of  i o.  One  might  decrease  uj 
beyond  our  final  value.  Our  experience  is  that  if  u>  becomes  small  too  quickly,  more  iterations  may 
be  required.  As  robustness  is  more  important  than  exact  equilibration,  we  have  chosen  the  final 
value  of  u>  accordingly.  Still,  a  careful  parameter  study  of  to  for  certain  classes  of  matrices  appearing 
in  a  particular  application  may  reveal  a  better  sequence  {w^}. 

Nonsymmetric  square  matrices 

Our  original  motivation  for  this  work  was  equilibrating  symmetric  matrices,  but  the  existence  and 
uniqueness  theory  in  Section  2.2  supports  both  symmetric  and  nonsymmetric  matrices,  and  SSBIN 
extends  immediately  to  SNBIN.  Hence  we  present  results  for  the  latter. 

In  addition  to  SNBIN,  we  test  the  following  other  algorithms: 

•  ROW  NORM(u).  Estimate  the  row  norm  of  the  matrix  using  u.  Let  r  be  the  row  norms.  The 
scaled  matrix  is  i?1/2  A. 

•  ROW  NORM(z).  Estimate  the  row  norm  of  the  matrix  using  z. 

•  SK.  The  Sinkhorn-Knopp  iteration. 

•  ROW  NORM.  Exact  row  norm. 

The  two  stochastic  ROW  NORM  methods  require  only  a  forward  matrix-vector  product  per  iteration. 
Since  they  are  being  compared  with  SNBIN,  which  requires  both  a  forward  and  a  transpose  matrix- 
vector  product,  they  are  permitted  two  forward  products  for  every  iteration  of  SNBIN. 

Figure  2.8  shows  results  for  all  the  methods,  and  Figure  2.9  for  just  the  stochastic  methods. 
The  random  vector  u  appears  to  perform  slightly  better  than  2  in  the  ROW  NORM  methods  on  the 
matrices  in  this  set  when  the  number  of  matrix- vector  products  is  small.  Forp  >  1%,  all  the  methods 
do  better  on  average  than  A. 

The  most  important  observation  comes  from  Figure  2.9:  as  the  stochastic  ROW  NORM  methods 
require  the  same  work  as  SNBIN,  and  as  SNBIN  outperform  those  methods  on  average,  one  might 
prefer  the  more  complicated  SNBIN  to  the  simple  ROW  NORM  methods. 

Interestingly,  the  Sinkhorn-Knopp  iteration  does  not  do  particularly  well  on  these  test  problems 
under  the  restrictions  on  the  number  of  matrix-vector  products  we  impose.  Indeed,  Figures  2.8(c,d) 
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show  that  the  iteration’s  performance  is  somewhat  worse  than  SNBlN’s.  We  believe  the  disparity 
in  performance  relates  to  an  observation  about  the  Sinkhorn-Knopp  iteration  we  made  early  in  our 
investigations:  it  takes  a  while  for  the  iterates  to  settle  down.  One  can  regularize  the  Sinkhorn-Knopp 
iteration  in  several  ways,  and  we  have  found — although  we  do  not  present  the  experiments,  as  they 
are  just  some  of  many  we  performed  that  are  outside  the  scope  of  this  chapter — that  certain  methods 
of  regularization  can  improve  the  behavior  of  the  Sinkhorn-Knopp  iteration  in  the  early  iterations, 
although  at  the  cost  of  decreased  convergence  speed  in  later  iterations.  SNBIN  effectively  implements 
one  form  of  regularization  simply  by  averaging  the  current  iterate  with  a  stochastic  estimate  of  the 
next  Sinkhorn-Knopp  iterate.  In  any  case,  regularizing  the  Sinkhorn-Knopp  iteration  when  one  has 
access  to  the  elements  of  the  matrix  is  almost  certainly  not  the  right  thing  to  do;  rather,  one  should 
implement  one  of  the  algorithms  described  in,  for  example,  [36]. 

Rectangular  matrices 

Again,  our  primary  concern  was  symmetric  matrices,  but  SNBIN  works  on  both  square  and  rectan¬ 
gular  matrices.  Recall,  however,  that  we  have  not  developed  a  theory  of  e-scaling  for  rectangular 
matrices. 

In  our  tests,  a  matrix  is  transposed,  if  necessary,  so  that  it  has  more  rows  than  columns.  In 
addition  to  the  other  stochastic  algorithms,  we  also  test  COLUMN  NORM  (u);  it  is  implemented  just 
as  ROW  NORM  (u)  but  for  the  columns. 

Figure  2.11  shows  our  results.  Figure  2.11(b)  compares  using  u  and  z  vectors  for  the  ROW  NORM 
algorithm.  In  contrast  to  the  results  on  the  test  set  of  square  nonsymmetric  matrices,  the  algorithm 
using  z  slightly  outperforms  that  using  u. 

Fixed  number  of  matrix-vector  products 

In  the  previous  experiments,  we  have  set  the  number  of  matrix-vector  products  proportional  to  the 
size  of  the  problem.  Figure  2.12  shows  results  of  experiments  in  which  the  number  of  matrix-vector 
products  is  fixed  for  all  problems. 


2.5  Summary  and  conclusions 

Scaling  a  matrix  usually  reduces  its  condition  number.  Perhaps  the  simplest  scaling  method  for 
symmetric  matrices  is  Jacobi  scaling.  But  this  method  is  not  effective  for  many  indefinite  matrices. 
Other  simple  scaling  methods  are  based  on  using  the  row  or  column  norms. 

Equilibration  in  a  p-norm  can  be  an  effective  alternative.  Using  a  combination  of  theory  and 
numerical  experiments,  we  showed  that  symmetric  binormalization  and  Jacobi  scaling  are  about 
equally  effective  on  symmetric  positive  definite  matrices.  But  symmetric  binormalization  is  quite  a 
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cond(R1/2  A  C1/2)  (N=263,  MVP=5.0%)  cond(R1/2  A  C1/2)  (N=258,  MVP=0.5%) 


(a) 


(b) 


MVR(R1/2  A  C1'2)  (N=264,  MVP=5.0%) 


(c) 


(d) 


cond(R1/2  A  C1/2)  (N=264,  MVP=5.0%) 


(e) 


(f) 


Figure  2.11:  The  performance  of  SNBIN  on  rectangular  matrices 
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cond(X  A  X'“)  (N=317,  MVP=40) 


(a) 


cond(X  A  X'l£)  (N=820,  MVP=40) 


Figure  2.12:  The  performance  of  the  stochastic  methods  when  the  number  of  matrix-vector  products 
is  fixed  for  all  matrices,  (a,  c)  Performance  profiles  for  all  stochastic  methods  on  symmetric  and 
square  nonsymmetric  matrices,  respectively,  when  40  matrix-vector  products  are  used,  (b,  d)  The 
performance  of  SSBIN  and  SNBIN,  respectively,  for  10,  20,  40,  and  80  matrix-vector  products. 


bit  better  than  Jacobi  scaling  on  indefinite  matrices.  For  nonsymmetric  matrices,  it  is  also  better, 
on  average,  than  simply  scaling  to  unit  row  or  column  norms. 

Approximate  scaling  is  a  practical  alternative  to  exact  scaling,  and  we  wanted  to  develop  a  theory 
characterizing  it.  We  reviewed  the  theory  governing  exact  scaling  and  developed  the  theory  of  e- 
scaling.  It  was  already  known  that  a  (symmetric)  square  matrix  is  (symmetrically)  scalable  if  and 
only  if  it  has  total  support  (part  1  of  Theorem  1  and  Theorem  2),  and  it  is  diagonally  equivalent  to 
a  unique  doubly  stochastic  matrix  (part  2  of  Theorem  1).  We  discovered  that  a  (symmetric)  square 
matrix  is  (symmetrically)  e-scalable  if  and  only  if  it  has  support  (Theorems  5  and  3) ,  and  the  doubly 
stochastic  matrix  C  obtained  in  the  limit  e  — »  0  is  unique  (Theorem  6). 

While  the  class  of  scalable  matrices  excludes  some  numerically  nonsingular  matrices,  the  class 
of  £-scalable  matrices  is  precisely  the  class  of  structurally  nonsingular  matrices  and  so  includes  all 
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numerically  nonsingular  matrices.  This  observation  suggests  that  e-scaling  is  a  useful  formalization 
of  approximate  scaling. 

Previously  developed  scaling  methods  require  access  to  the  elements  of  a  matrix.  We  developed 
the  matrix-free  stochastic  binormalization  methods  SSBIN  and  SNBIN  to  perform  approximate  sym¬ 
metric  and  nonsymmetric  scaling.  The  performance  of  these  algorithms  on  a  large  test  set  suggests 
that  stochastic  binormalization  scales  a  matrix  using  a  number  of  matrix-vector  products  that  is 
small  relative  to  the  size  of  the  matrix  and  so  is  a  practically  useful  method  when  direct  access  to 
the  elements  of  a  matrix  is  not  available. 


Chapter  3 


A  limited-memory  quasi-Newton 
method 


In  this  chapter,  we  develop  a  limited-memory  (LM)  quasi-Newton  (QN)  method.  It  has  two  main 
features:  it  uses  a  circular  buffer  to  store  QN  pairs,  and  it  updates  an  initial  matrix  Bg  each  time 
it  updates  the  buffer.  We  develop  an  update  that  sets  B0  to  a  positive  diagonal  matrix  D  such  that 
Z)-1/2  approximately  binormalizes  the  Hessian. 

First,  we  provide  the  foundation  for  the  update.  Second,  we  describe  the  diagonal  update  and 
the  rest  of  the  algorithm.  Third,  we  discuss  the  implementation  of  the  algorithm  in  the  current 
version  of  SNOPT  [25]  and  a  future  version  that  will  have  a  new  quadratic  program  solver.  Finally, 
we  describe  the  performance  of  the  modified  version  of  SNOPT  on  several  test  problem  sets. 

3 . 1  Background 

3.1.1  Notation 

Let  n  be  the  size  of  the  problem.  The  difference  between  two  iterates  Xk+i  and  Xk  is  Sk  =  Xk+i  —  Xk, 
and  the  difference  in  the  gradients,  denoted  g ,  of  the  Lagrangian  at  the  two  iterates  is  yk  =  gk+i~9k- 
If  the  particular  iteration  update  index  k  is  irrelevant,  the  notation  s  =  x+  —  x  may  be  used.  A  pair 
is  the  tuple  {s*,,  ?/*,}.  A  buffer  of  pairs  is  a  finite  sequence  of  pairs  and  is  denoted  {sj,yj}jLkl  for 
fci  <  fc2.  Let  np  be  the  number  of  pairs  the  pair  buffer  can  hold.  B  =  bfgs({sj,  yj}j=kl ,  B0)  denotes 
the  limited- memory  BFGS  Hessian  approximation,  where  Bq  is  the  initial  matrix,  {.sy ,  yj  is 
the  pair  buffer,  and  fc2  —  fci  +  1  <  np.  In  some  cases,  the  second  argument  is  a  vector,  say  do ,  and 
B0  =  diag(do)- 

The  key  property  of  quasi-Newton  methods  is  the  quasi-Newton,  or  secant ,  condition,  Bk+iSk  = 
yk-  We  omit  a  review  of  quasi-Newton  methods,  as  it  is  far  better  to  read  those  in,  for  example,  [26] 
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and  [51]. 


3.1.2  Linear  invariance  of  Broyden-class  updates  and  diagonal  scaling 


As  a  first  step,  we  examine  the  diagonal  matrix  Bq  in  the  context  of  Broyden-class  updates. 


Linear  invariance 


Let  /(x)  =  f(x),  where  x  =  Cx  and  C  is  a  nonsingular  matrix.  Then 

g(x)  =  Vx/(x)  =  CTg[x)  (3.1) 

H(x)  =  V2J(x)  =  CtH(x)C. 


Corresponding  to  s  and  y  are  the  transformed  vectors  s  =  Cs  and  y  =  C  Ty.  Let  B  be  a  quasi- 
Newton  approximation  to  H  that  is  updated  according  to  a  formula  in  the  Broyden  class: 


Vk 


Bk+i 


Vk _ BkSk 

VkSk  S^BkSk 
Bksksk Bk 

71  r  ) 


+  4>k{slBksk)vkvl , 


(3.2) 


where  (f>k  is  a  scalar. 


Theorem  11.  If  Bq  =  CT BqC ,  and  .s,;  =  Csi  and  yi  —  C  Tyi  hold  for  all  Si,  yi,  Si,  yi,  then 
Bk+i  =  CTBk+1C. 


This  property  of  the  update  is  widely  known,  although  we  do  not  have  a  specific  reference  for  exactly 
the  form  we  use  here.  The  proof  is  by  induction. 


Proof.  Suppose  Bk  =  CT BkC .  Then  we  have  the  following  equalities: 

BkSkSTkBk  =  (CTSfcC')(C-1sfe)(C-1sfc)r(C'T5feC)  =  CT(BkskslBk)C 
skBksk  =  (C~1sk)T  (CT  BkC){C~l  sk)  =  skBksk 
Vkvl  =  (CTyk)(CTyk)T  =  CT(ykyk)C 
SkVk  =  (C'"1sfe)T(C'Tyfe)  =  slyk 
Bkskyl  =  {CTBkC){C-1sk){CTyk)T  =  CT{Bkskyl)C. 
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Hence 


Bk+i  =  Bk  — 


=  C 1 


BkSkSk  Bk 
sT.Bksk 


VkVl 

SkVk 


t{sTkBkSk)vkvl 


-  _  BkSkslBk  ykyl 

skBksk  skyk 


4>(slBkSk)vkvj,  )C 


=  C 1  Bk+1C. 


□ 


Bk  is  a  preconditioner  for  a  steepest-descent  method 

A  quasi-Newton  matrix  can  be  thought  of  as  a  preconditioner  for  the  problem.  Let  the  QN  matrix 
Bk  =  CTC.  Consider  again  the  function  /( x).  The  quasi-Newton  method  gives  the  search  direction 

V  =  -B-\g(x)  =  - C-lC~Tg{x )  =  -C^x). 

With  step  size  a ,  the  new  iterate  is 

x  —  aBk1g{x)  =  C_1x  —  aC^1g(x)  =  C-1(x  —  ag(x)); 


hence  the  QN  step  is  the  steepest-descent  step  for  the  transformed  problem  in  the  transformed  space. 
We  think  of  the  transformed  problem  f(x)  as  the  preconditioned  problem. 

That  Bk  is  a  preconditioner  has  a  recursive  aspect  to  it.  Again,  let  Bk  =  CTC.  By  Theorem  11, 


Bk+ i 


BkSkSj.  Bk 
T  u 

sk  Bksk 


=  CT 


-  -T 

Sksi 

-T- 

s{  Sk 


+ 


+  +  <p(sk  Bksk)vkvk 

skVk 

+  0(skSk)vkVk')  C, 

skVk  > 


where  sk  =  Csk  and  yk  =  C  T yk\  hence  the  Broyden-class  update  may  be  thought  of  as  an  update 
of  the  identity  matrix  for  a  problem  preconditioned  by  Bk  =  CTC. 


Scaling 

In  the  special  case  that  B0  =  D  is  a  positive  diagonal  matrix,  D-1/2  is  a  scaling  matrix  that  scales 
the  original  problem.  If  we  view  Bq  as  a  preconditioner,  then  an  effective  procedure  is  to  set  B$  =  H, 
where  scaling  the  Hessian  by  D minimizes  its  condition  number  over  all  positive  scaling  matrices. 

In  general  the  Hessian  is  a  function  of  the  iterate  and,  moreover,  we  do  not  have  access  to  it 
either  by  element  or  by  matrix-vector  products.  This  chapter  describes  a  method  that  approximately 
binormalizes  the  Hessian  based  on  the  sequence  of  pairs  { s y ,  x/y  } .  The  diagonal  Bq  is  updated  every 
time  the  QN  matrix  as  a  whole  is  updated,  and  so  the  diagonal  scaling  matrix  changes  with  the 
iterate. 
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3.1.3  SNOPT 

SNOPT’s  current  limited-memory  quasi-Newton  method  is  as  follows  [25].  Let  r  be  a  major  iteration 
when  the  pair  buffer  is  empty.  Bq  is  a  positive  diagonal  matrix  that  initializes  the  approximation  to 
the  Hessian.  For  r  <  k  <  r  +  np ,  the  approximation  B k  is  updated  as  follows: 


bj  —  BjSj 
=  ivjsj)-1 

fa  =  (qjsjfa1 

k- 1 

Bk  =  B0  +  ^(djVjyj  -  faqjqj)-  (3.3) 

j—r 

This  form  of  the  update  is  called  the  summation  form.  To  assure  positive  definiteness  in  the  presence 
of  numerical  errors,  the  update  is  rewritten  in  product  form  as  Bk  =  G^Gk,  where 

Vj  =  ±(9j(j)j)  I  yj  —  4>jqj  (3-4) 

Gk  =  B10/2l[(I  +  sjvf). 

j=r 

The  sign  in  Vj  is  chosen  to  minimize  cancellation  error  when  forming  the  vector. 

At  update  index  k  =  r  +  np  + 1,  both  the  full  approximation  and  Bq  are  reset  to  the  diagonal  of 
Br+np+ 1  ( i.e the  diagonal  of  the  approximation  that  includes  the  new  pair  {sj,  Vj}j=r+np+i)i  and 
all  pairs  { Sj ,  Vj  }  (including  the  one  for  j  =  r  +  np  + 1 )  are  flushed  from  the  pair  buffer.  Additionally, 
the  diagonal  matrix  is  multiplied  by  a  scalar  to  calibrate  the  unit  step.  The  implementation  of 
the  diagonal  matrix  Bq  in  the  context  of  this  scheme  is  quite  efficient:  two  vectors  bo  and  bo  store 
diagonal  information,  bo  is  static  during  the  accumulation  of  pairs  in  the  buffer;  meanwhile,  bo  is 
efficiently  updated  at  each  major  iteration.  When  the  buffer  is  flushed,  bo  is  set  to  bo- 

A  feature  of  SNOPT’s  current  method  -indeed,  a  primary  motivation  for  its  form — is  that  the 
factorization  of  the  reduced  Hessian  can  be  updated  directly  when  SNOPT  solves  a  linearly  con¬ 
strained  problem. 

3.1.4  Our  work 

For  nonlinearly  constrained  problems  and  linearly  constrained  problems  for  which  evaluating  the 
objective  is  expensive,  SNOPT’s  method  may  be  improved  in  two  ways.  First,  rather  than  discarding 
all  pairs  at  certain  iterations,  we  could  maintain  a  circular  buffer  of  pairs.  Second,  rather  than 
updating  the  initial  diagonal  Bo  only  at  certain  iterations,  it  could  be  updated  at  every  iteration  if 
information  is  available  to  support  such  frequent  updates.  This  chapter  develops  a  method  having 
these  two  properties. 
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3.1.5  Related  work 

Possibly  the  earliest  work  on  a  limited-memory  method  that  does  not  enforce  a  sparsity  pattern 
is  that  of  Nocedal  [50].  He  developed  the  method  L-BFGS,  which  maintains  a  circular  buffer  of  a 
limited  number  of  pairs.  The  method  forms  the  foundation  for  many  modern  variations. 

The  simplest  updating  method  for  B o  is  to  set  Bq  =  Ofc/,  where  <Jk  is  updated  at  each  update  of 
the  QN  matrix.  Liu  and  Nocedal  [40] ,  among  others,  set  ct*,  =  y ^ yk / s ^ yk  ■ 

Recent  work  on  quasi-Newton  methods  includes  that  of  Kolda,  O’Leary,  and  Nazareth  [37] 
and  Gill  and  Leonard  [23,  24].  Kolda  and  her  coworkers  studied  limited- memory  methods  using 
Broyden-class  updates.  They  showed  that  only  BFGS  among  the  Broyden-class  updates  maintains 
the  property  of  finite  termination  on  unconstrained  convex  quadratic  problems  when  modified  to  be 
a  limited-memory  method.  They  also  tested  several  variants  of  full-  and  limited-memory  methods, 
studying  update  skipping  and  various  rules  for  discarding  pairs  from  the  pair  buffer. 

Gill  and  Leonard  [23]  developed  a  quasi-Newton  method  that  pays  particular  attention  to  the 
subspace  spanned  by  the  gradients  or,  equivalently,  the  search  directions.  The  method  optionally 
can  limit  the  subspace  from  which  the  next  search  direction  is  obtained.  In  the  case  of  a  full-memory 
method,  their  factorization  of  the  BFGS  matrix  permits  the  update  of  Bo  =  <JkI ;  the  authors  note 
that  this  can  be  quite  advantageous  in  the  full-memory  case,  as  a  poor  static  matrix  Bo  =  al  can 
slow  convergence.  They  developed  a  limited-memory  variant  [24]  of  the  method  that  uses  half  the 
storage  of  standard  limited-memory  methods,  including  the  one  we  use  in  this  chapter. 

Several  authors  have  studied  methods  using  diagonal  initial  matrices  B0.  SNOPT’s  current 
method  is  of  course  one  such  method.  Nazareth  [48]  and  Zhu,  Nazareth,  and  Wolkowicz  [66]  have 
developed  quasi-Newton-like  methods — they  call  them  quasi-Cauchy  methods — based  on  the  quasi- 
Cauchy,  rather  than  quasi-Newton,  condition  sJ^Dk+iSk  =  s^yk,  where  Dk+ 1  is  a  diagonal  matrix. 
The  condition  multiplies  each  side  of  the  quasi-Newton  condition  by  .sj.  These  methods  are  not 
themselves  relevant  to  our  work;  however,  in  the  conclusion  to  [66],  the  authors  remark  that  the 
diagonal  matrix  Dk+i  was  incorporated  into  L-BFGS  in  the  thesis  work  of  Zhu.  We  shall  remark 
further  on  this  in  a  moment. 

The  work  closest  to  ours  is  that  of  Gilbert  and  Lemarechal  [22]  and  the  follow-up  work  of  Veerse 
and  Auroux  [64]  and  Veerse,  Auroux,  and  Fisher  [65].  These  authors  tested  several  diagonal  updates 
with  the  L-BGFS  method  on  a  set  of  unconstrained  problems.  Their  best  diagonal  update  is,  with 
the  exception  of  implementation  details  and  a  calibration  of  the  unit  step  length,  the  same  as  the 
one  we  develop.  However,  they  do  not  attempt  to  understand  why  the  update  is  effective,  nor  do 
they  extend  the  implementation  to  constrained  problems. 

Interestingly,  the  authors  of  [65]  assessed  incorporating  the  quasi-Cauchy  method  of  [66]  into 
L-BFGS,  as  the  latter  authors  had  suggested  in  their  concluding  section,  and  found  the  results  were 
relatively  poor.  Prior  to  discovering  the  work  of  Veerse  and  coworkers,  we  had  implemented  this 
combined  method  and  could  not  obtain  good  results.  Our  view  is  that  the  quasi-Cauchy  method, 
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restricted  as  it  originally  was  simply  to  a  diagonal  matrix,  may  be  quite  effective.  But  the  interpre¬ 
tation  of  a  diagonal  Bq  as  a  matrix  that  attempts  to  capture  curvature  information  is  incompatible 
with  subsequently  updating  this  matrix  to  a  full  one:  as  the  discussion  in  Section  3.1.2  shows,  in 
the  context  of  a  Broyden-class  method,  a  diagonal  Bq  really  ought  to  be  viewed  as  a  scaling  matrix. 

Morales  [45]  conducted  a  numerical  study  comparing  L-BFGS  [50]  with  SNOPT  on  a  set  of 
unconstrained  problems.  Of  course  SNOPT’s  reduced-Hessian  method  is  computationally  unsuited 
to  large  unconstrained  problems,  but  Morales  correctly  used  the  number  of  calls  to  the  user  function 
to  assess  the  two  methods.  He  found  that  L-BFGS  generally  outperformed  SNOPT. 

SNOPT’s  current  method  cannot  be  directly  modified  to  accommodate  a  circular  buffer  and 
continual  diagonal  update.  SNOPT  stores  the  pairs  {sj,  Vj}.  If  B0  changes,  then  so  does  each  Vj,  for 
Vj  depends  on  qj  =  Bj-iSj,  and  Bj-i  changes  with  Bo-  Furthermore,  the  efficient  implementation 
using  the  vectors  bo  and  bo  is  no  longer  possible,  for  in  general  one  pair  is  removed  from  the  circular 
buffer  at  each  iteration.  If  we  ignore  matters  of  implementation,  we  can  generalize  SNOPT’s  update 
to  the  diagonal  matrix  by  setting  it  to  the  diagonal  of  the  Hessian  approximation  at  the  previous 
iteration.  Let  Bk  =  bfgs {{sj,yj}kjZ.\_n  ,dk ),  where  Bq  =  diag (dk)  is  the  current  initial  diagonal 
matrix.  At  the  next  iteration,  dk+i  is  set  to  diag(Hfe),  the  pair  {sj,yj}j=k-np  is  removed  from  the 
buffer,  the  pair  {sk,yk}  is  added,  and  Bk+ 1  =  bfgs({sj,  yj}j=k-np+i’  dk+i)- 

One  might  imagine  other  diagonal  updates  as  well.  Our  objective  in  this  chapter  is  to  develop 
a  method  whose  diagonal  update  has  a  motivating  interpretation,  and  to  implement  the  method 
efficiently  in  the  context  of  a  limited-memory  method  that  uses  a  circular  buffer. 


3.2  The  diagonal  update 

This  section  discusses  the  diagonal  update  for  our  limited-memory  quasi-Newton  Hessian  approxi¬ 
mation. 

Let  A  be  a  symmetric  positive  definite  matrix.  Let  A  =  D~x!2  AD~X!2  be  binormalized;  in 
particular,  let  Bx  =  d  (recall  B  =  A  o  A  and  X  =  D~x]  we  use  B  to  denote  both  the  element-wise 
square  of  A  and  the  QN  matrix,  but  the  meaning  should  be  clear  from  the  context).  Consider  the 
problem 

min  f{x)  (3.5) 

X 

for 

f(x)  =  -xT  Ax  =  -xT  D1/2  AD1/2  x  =  -xTAx, 

where  x  =  Dxl2x.  We  refer  to  quantities  having  a  tilde  as  scaled  quantities.  The  gradient  is 
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g(x)  =  yxf{ x)  =  Ax  and  the  scaled  gradient  is  g(x)  =  Vj/(x)  =  Ax,  and  so 


g(x)  =  D1/2g{x). 


Let  the  approximate  Hessian  used  in  the  iteration  be  B.  The  search  direction  is  p  =  —B  1g ,  the 
step  is 

s  =  ap  =  —uB~1g  =  —aB~1D1^2g, 
and  the  change  in  gradient  is 

y  =  As  =  —aAB~xg  =  —aAB~1D1/2g, 
where  a  is  the  step  size. 

3.2.1  The  algorithm  design  principles 

The  motivation  for  our  diagonal  update  is  the  stochastic  symmetric  binormalization  algorithm  of 
Chapter  2.  In  an  unconstrained  quadratic  program,  y  is  related  to  s  by  a  matrix-vector  product. 
We  explore  the  hypothesis  that  expressions  involving  s  and  y  provide  information  about  the  scaling 
of  the  problem,  much  as  the  stochastic  binormalization  algorithm  uses  matrix-vector  products  with 
the  random  vector  u. 

Let  us  remark,  however,  that  the  discussion  in  this  section  is  not  a  proof  of  any  sort.  Iterations  in 
optimization  algorithms  are  quite  complicated,  and  it  would  be  difficult,  if  not  impossible,  to  formu¬ 
late  and  prove  a  meaningful  theorem  about  the  diagonal  update  in  the  context  of  such  complexity. 
The  update  we  describe  turns  out  to  be  useful,  and  our  discussion  attempts  to  explain  why. 

Our  first  task  is  to  find  an  analog  in  the  current  setting  to  the  random  vector  u  in  the  stochastic 
binormalization  algorithm.  It  is  likely  that  a  scaled  vector  is  better  than  an  unsealed  one.  x  is 
not  the  right  one,  as  the  behavior  of  it  depends  on  the  origin,  s  =  Dx^2s,  y  =  As,  and  g  are  all 
independent  of  the  origin.  We  shall  ultimately  use  g,  but  we  also  consider  y. 

Consequently,  our  first  design  principle  is  that  g/||<?||  behaves  like  a  random  vector  sampled 
independently  from  the  uniform  distribution  on  the  sphere. 

Next,  the  Hessian  approximation  B,  if  too  complicated,  can  make  analysis  difficult.  In  designing 
our  diagonal  update,  we  assume  B  =  D,  where  D  is  our  diagonal;  that  is,  we  assume  there  is  no 
other  quasi-Newton  update  to  B. 

Finally,  we  use  an  unconstrained  convex  quadratic  problem  as  our  model.  Constraints  impose 
restrictions  on  the  space  from  which  g  is  drawn.  However,  just  as  in  our  stochastic  binormalization 
algorithm,  our  interest  is  in  what  happens  during  only  relatively  few  iterations.  If  the  constraints 
are  not  too  restrictive,  their  presence  may  not  be  obvious  in  the  sequence  of  g  vectors  until  after 
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many  iterations  have  occurred. 

The  motivation  for  these  principles  is  that  many  probability  density  functions  may  have  the 
property  stated  in  (2.17) — not  necessarily  just  those  pdfs  considered  in  Lemmas  8  and  9 — or  a 
similarly  useful  one,  and  there  may  be  many  processes  that  update  B  along  with  the  diagonal 
update  that  do  not  significantly  affect  the  behavior  of  the  diagonal  update.  But  these  may  be 
analytically  difficult  or  intractable  to  study.  What  we  need — and  what  we  believe  we  achieve  with 
our  design  principles — is  an  analytical  setting  simple  enough  to  admit  theoretical  exposition  but  rich 
enough  that  we  can  explain,  at  least  partially,  why  our  algorithm  works  well  in  practice. 

3.2.2  Some  useful  results 

Estimators  of  the  trace 

Estimators  similar  to  those  in  Lemmas  8  and  9  exist  for  the  trace  of  a  matrix.  Suppose  the  matrix 
A  and  the  random  vector  u  are  independent. 

Lemma  13.  If  the  elements  of  u  G  Rn  have  zero  mean,  positive  and  finite  variance,  and  are  iid, 
then  Eut  Au  =  r]  E  trace  A  for  finite  r)  >  0. 

Proof.  The  cross  terms  E  UiUj  =  0  if  i  ^  j,  and  so 


where  i)=E«(  >  0. 


□ 


Lemma  14.  If  the  elements  of  u  G  M"  have  positive  and  finite  variance  and  are  iid  with  pdf  f,  and 
f  is  symmetric  around  zero,  then 


_ u  Au  i  „ 

E — —  =  —  E  trace  A. 

ii -L  ii  n 


u1  u  n 


Proof.  A  cross  term  has  the  form  (2.18),  and  so  we  are  left  with 


where  n  =  E  1  „  =  n 


□ 


Construction  of  random  vectors  and  matrices 


In  the  absence  of  analytical  results,  we  can  explore  properties  of  an  algorithm  by  running  it  on 
randomly  generated  data.  To  obtain  generic  results,  it  is  important  to  sample  data  from  a  meaningful 
distribution. 


3.2.  THE  DIAGONAL  UPDATE 


53 


We  use  random  spd  matrices  to  test  some  of  the  ideas  in  this  chapter.  How  can  we  assure  that 
our  experiments  capture  generic  properties  of  the  algorithm?  For  example,  does  sparsity  matter? 
Scaling? 

Symmetric  positive  definite  matrices  are  entirely  characterized  by  an  orthonormal  matrix  and 
a  positive  diagonal  matrix:  A  =  QAQT .  A  sparse  matrix  A  becomes  dense  upon  rotation  by  a 
generically  chosen  orthonormal  matrix.  Hence  if  an  algorithm  is  invariant  to  a  transformation  by 
an  orthonormal  matrix,  then  sparsity  does  not  affect  the  mathematical  behavior  of  the  algorithm, 
although  it  may  affect  storage  and  computation  efficiency.  Indeed,  such  an  invariance  property  means 
it  is  enough  to  sample  only  from  a  distribution  over  positive  diagonal  matrices.  If  an  algorithm  is 
not  invariant  to  scaling,  then  it  is  not  invariant  to  transformation  by  an  orthonormal  matrix. 

Let  us  first  discuss  vectors.  Consider  the  uniform  distribution  of  points  on  a  unit  sphere;  for  the 
sphere  embedded  in  n  dimensions,  call  this  distribution  S[n].  We  generate  a  unit  vector  v  ~  S[n]  as 
follows  [47]: 

1.  Generate  v  such  that  each  element  is  an  iid  normal  variable. 

2.  Set  v  <—  u/||u||2. 

The  distribution  S[n]  obeys  the  conditions  of  Lemmas  9  and  14.  Observe  that  if  Q  is  orthonormal 
and  v  ~  S[n],  then  Qv  ~  S[n]  as  well. 

We  sample  three  types  of  matrices: 

1.  Positive  diagonal  matrices.  We  specify  the  distribution  in  the  context  of  the  experiment. 

2.  Symmetric  positive  definite  matrices.  We  generate  A  as  follows.  Let  A  be  a  positive  diagonal 
matrix  obtained  by  method  1. 

(a)  Generate  a  square  matrix  A  having  normally  distributed  iid  elements. 

(b)  Compute  the  QR  factorization  A  =  QR. 

(c)  Set  A  «-  QAQt. 

3.  Scaled  spd  matrices. 

(a)  Given  the  positive  diagonal  matrix  A,  generate  the  spd  matrix  A  by  method  2. 

(b)  Given  the  positive  diagonal  matrix  D,  set  A  <—  DAD. 

Expectation  of  a  product 

Because  Cov(x ,  y)  =  E  (x  —  E  x)  (y  —  E  y)  =  E  xy  —  E  xE  y , 


E  xy  =  E  xE  y  +  Cov(x,  y) . 


(3.6) 
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Jensen’s  inequality 

Consider  the  random  variable  x  and  a  convex  function  <f>.  Jensen’s  inequality  [15]  is 


(j>(Ex)  <  E  cj>(x). 


Consequently,  (Ex)  1  <  Ex  1. 

E 4>(x)  need  not  have  an  upper  bound:  consider  </>(x)  =  x-1  for  x  allowed  to  approach  0.  Still, 
(Ex)-1  can  be  a  good  approximation  to  Ex-1  if  x  is  bounded  away  from  0.  For  example,  suppose 
x  has  the  uniform  distribution  between  a  and  6,  and  a,  b  >  0:  x  ~  17 [a,  b\.  Then 


(Ex)-1 


Ex-1 


2 

a  +  b 


dx 


ln(6/a) 
b  —  a 


If  6/a  ~  1,  then  ln(6/a)  ~  (b  —  a) /a  and  so  Ex  1  ~  a  similarly,  2/(a  +  6)  ss  a  and  so  the  two 
expectations  are  approximately  equal. 


The  estimate  ( nuT Au )  1  ~  (traced)  1 

Lemmas  13  and  14  describe  a  matrix-free  estimate  of  the  trace  of  a  matrix.  We  examine  the 
hypothesis  that  if  u  ~  S[n]  and  A  is  an  n  x  n  spd  matrix, 

The  estimate  {nuT Au)-1  ~  (traced)-1  is  accurate.  (3-7) 

Our  only  analytical  characterization  of  the  relationship  between  the  two  expressions  is  the  fol¬ 
lowing: 

(trace  A)-1  =  (nEuT Au)^1  (by  Lemma  14) 

<E(rutTAu)-1  (by  Jensen’s  inequality).  (3.8) 

Rather  than  further  attempt  to  find  a  useful  analytical  characterization,  we  describe  two  numerical 
experiments  that  give  empirical  evidence  supporting  (3.7). 

The  first  experiment  demonstrates  that  the  inequality  (3.8)  is  almost  certainly  strict: 

1.  Choose  the  condition  number  c  of  the  matrix  A,  the  size  of  the  matrix  n,  and  the  number  of 
trials  nt. 

2.  Generate  a  vector  A  such  that  rriirq  A(j)  =  1  and  max,  A (j)  =  c  as  follows. 
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(a)  Generate  a  vector  u  such  that  each  of  n  —  2  elements  is  iid  and  u(j)  ~  f/[0, 1],  one  element 
is  0,  and  another  is  1. 

(b)  Set  u  <—  c(i)u. 

3.  Because  we  shall  compute  the  estimate  nvTAv  using  a  vector  v  ~  5[n],  and  so  QTv  ~  5[n] 
for  an  orthonormal  matrix  Q,  we  may  assume  the  matrix  A  is  simply  the  diagonal  matrix  A. 
Record  trace  A  =  ||  A||  i . 

4.  For  i  from  1  to  rit'. 

(a)  Generate  a  unit  vector  v  ~  5[n]. 

(b)  Record  the  following: 

i.  The  estimate  nvT Av  =  A(j)u(j)2. 

ii.  The  sample  mean  of  this  estimate,  f(i),  using  samples  1  to  i. 

iii.  The  estimate  (nvT Av)^1 . 

iv.  The  sample  mean  of  this  estimate,  r(«),  using  samples  1  to  i. 

(c)  Plot  t  and  r  against  the  true  values  trace  A  and  (trace  A)-1,  respectively. 

Figure  3.1(a)  shows  results  for  a  condition  number  of  1012  and  matrix  size  n.  Observe  that  the  sample 
mean  for  the  estimate  of  trace  A  converges  to  trace  A,  while  the  sample  mean  for  the  estimate  of 
(trace  A)-1  converges  to  a  value  greater  than  (trace  A)-1. 

The  second  experiment  explores  the  robustness  of  the  estimate  {nuT Au)~x  «  (trace  A)-1: 

1.  Choose  a  vector  c,  each  of  whose  elements  gives  a  condition  number  for  which  it  is  desired  to 
obtain  samples.  Choose  a  matrix  size  n.  Choose  the  number  of  trials  nt. 

2.  For  each  c(i): 

(a)  Generate  a  vector  A  as  in  the  first  experiment. 

(b)  Generate  a  unit  vector  v  ~  S[n]. 

(c)  As  in  the  first  experiment,  we  may  assume  the  matrix  A  is  simply  the  diagonal  matrix  A. 
For  each  of  nt  trials,  compute  and  record  the  following: 

i.  trace  A  =  ||  A ||  i . 

ii.  nvT Av  =  n^2j  Mj)vU)2- 

•  mi  i  x*  (trace  A)^1  —  (nvT  Av)~ 1 

m.  Hie  relative  error  r  =  A - ji - — L — . 

(trace  A)  1 

(d)  Plot  the  middle  p  percentile  of  relative  errors. 
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Figure  3.1(b)  shows  results  for  condition  numbers  between  102  and  1016,  number  of  trials  n*  =  104, 
and  matrix  size  n  =  103.  Consistent  with  the  inequality  (3.8),  the  stochastic  estimate  tends  to  be 
larger  than  (traced)-1.  But  the  estimate  is  certainly  meaningful:  half  the  sampled  estimates  lie 
well  within  ±20%  of  (traced)-1  (green  line),  and  90%  lie  within  ±40%. 

We  conclude  that  the  hypothesis  (3.7)  holds  and  so  we  may  use  the  relationship  between  the  two 
expressions  in  our  subsequent  analysis. 


3.2.3  The  update 

Stochastic  framework 

We  have  already  seen  that  we  have  the  two  basic  quantities 

s  =  ap  =  —aD~1g  = 
y  =  As  =  —aAD~1g  =  —oiAD~1D1l2g. 

Let  u  =  ff/||g||,  and  let  us  replace  each  instance  of  ag  with  u: 

s  =  —D~1D1^2u 
y  =  -AD~1D1/2u. 

We  can  remove  the  scalar  a||<?||  as  later  it  will  cancel  out  of  a  quotient. 

By  Lemmas  9  and  14,  these  latter  s  and  y  vectors  yield  (recall  X  =  D~1) 


E  uy2  =  n~1BD~2De  =  n~1BDx2  (3.9) 

Eu(Ds)2  =  Eud2s2  =n~1d  (3.10) 

E  usT  Ds  =  n^trac  e  D1^2  XD1^2  =  n~1xTd  (3-11) 

E  uyTs  =  n_1trace  D1/2XAXD1/2  =  n_1trace  ADX2  (3.12) 

Eu2/Ty  =  retrace  D1^2XA2XD1^2  =  n~ 1  trace  A2DX2  (3.13) 

EusTs  =  n^trace  D1^2  X2  D1^2  =  n~1dTx2,  (3-14) 


where  the  subscripted  u  on  the  expectation  operator  indicates  the  replacement  by  u. 

In  this  section,  we  derive  a  deterministic  iteration  based  on  the  rhs  of  each  of  (3.9)-(3.14). 
First,  observe  that  Eud2s2  =  vTld  immediately  provides  the  scaling  matrix.  Even  better,  in  the 
case  of  s,  we  need  not  make  the  simplifying  assumption  B  =  D.  Then  Eu(Bs)2  =  n~1d.  But  y  has 
quite  valuable  data  and  so  should  not  be  ignored. 

Second,  suppose  v  =  y/||y||,  rather  than  u  =  <?/||ff||,  is  taken  to  be  sampled  from  a  uniform 
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(trace  A)  1 


Relative  error  (%):  (trace  A)  1  vs.  (uT  A  u)”1 


Figure  3.1:  Results  for  two  numerical  experiments  studying  the  relationship  between  (traced)-1 
and  the  estimate  (nuT Au)-1 .  (a)  Experiment  1.  The  x  axis  is  the  number  of  trials  used  to  compute 
the  sample  mean.  The  y  axes  correspond  to  the  values  traced,  (traced)-1,  and  their  estimates. 
The  dashed  lines  are  the  true  values,  (b)  Experiment  2.  The  x  axis  is  the  condition  number  of  A; 
the  y  axis  is  the  relative  error  of  the  estimate.  Each  line  color  corresponds  to  the  indicated  middle 
percentile  of  samples. 


58 


CHAPTER  3.  A  LIMITED-MEMORY  QUASI-NEWTON  METHOD 


distribution  on  the  sphere;  and  let  us  replace  each  instance  of  ay  by  v.  Then 
y  =  -D1/2v 

s  =  -A-'b^v  =  -b-b*A-\ 

where  we  have  assumed  A  has  full  rank;  and,  analogous  to  the  quantities  obtained  from  the  Eu 
operator, 


Evy2  =  n  1d 

E  vs2  =  n~1xi'Y^{A-l)2iy 
o 

The  expression  for  s 2  is  not  useful.  A  simple  stochastic  iteration  based  on  the  expression  for  y 2  is 
d+  =  (1  —  ui)d  +  toy2 , 


although  we  shall  use  a  more  complicated  one  in  a  numerical  test.  Again,  however,  we  seek  an 
iteration  based  on  both  s  and  y. 


Consider  the  iteration 

,+  _  ,  d2s2  y2 

d  d  sTDs  +  yTs' 

An  equivalent  expression  for  the  diagonal  update  is 


j.  I  „  DsstD 
d  =d,‘g(  D~^Di  + 


yy 

yTs 


(3.15) 


(3.16) 


The  diagonal  update  is  obtained  simply  by  extracting  the  diagonal  of  the  BFGS  update  to  a  diagonal 
matrix.  As  a  BFGS  update  preserves  positive  definiteness  and  the  main  diagonal  of  an  spd  matrix 
is  positive,  d+  >  0  if  d  >  0. 


Consider  any  two  random  quantities  y  and  z  inside  the  Eu  operators  among  (3.9)-(3.14).  From 
the  discussion  in  Section  3.2.2,  Eyz  ^  EyEz — the  two  are  separated  by  Co v{y,z)  according  to 
(3.6) — and  E y"1  ^  (Ey)-1,  Jensen’s  inequality  providing  a  one-sided  bound.  Furthermore,  we 
suggested  in  that  same  section  that  Ez_1  «  (Ez)_1  if  z  is  bounded  away  from  0.  In  particular,  we 
discussed  the  accuracy  of  the  approximation  when  z  =  uT  Au  for  u  ~  <S[n];  see  the  discussion  of  the 
hypothesis  (3.7).  Here  we  neglect  Cov(y,  z);  and  as  our  denominators  will  turn  out  to  be  of  the  form 
z  =  uTAu,  we  make  the  approximation 

E-  «  — 
z  ~  E z' 


(3.17) 
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This  approximation  justifies  our  removing  a||<?||  when  we  defined  the  Eu  operator. 
According  to  the  approximation  (3.17)  and  (3.9)-(3.14), 

d2s 2  ^  d 
sTDs  xTd 
y 2  ^  BDx 2 

yTs  trace  AD X2' 

Therefore,  the  deterministic  iteration  corresponding  to  (3.15)  is 


d+  =  d  —  lo 


BDx 2 


trace  ADX2  ^ 


=  G(d), 


(3.18) 


where  the  scalar  u>  is  introduced  for  generality.  Notice  that  scaled  quantities  do  not  appear  in  (3.15) 
but  do  appear  in  (3.18).  The  decomposition  A  =  D1!2 AD1^2  of  course  is  not  known  to  us  and  is  for 
analysis  only:  it  reveals  the  stochastic  binormalizing  process  underlying  (3.15). 


Analysis  of  the  deterministic  iteration 

The  analysis  in  this  section  is  very  similar  to  that  surrounding  Theorem  9. 

Theorem  12.  d  =  £d,  where  £  =  n/traceA ,  is  the  unique  fixed  point  of  the  iteration  (3.18). 
Proof.  Substituting  d  =  into  (3.18)  yields 


,4.  j.  r  uj£d  ujBx 

d+  =  £d - —  H - ^ 

n  trace  AX 


=  £d- 
=  & 


umd 


uBx 


n  trace  A  trace  A 


where  we  have  used  the  equations  Bx  =  d  and  trace  AX  =  trace  A.  Hence  t;d  is  a  fixed  point. 
d  is  a  fixed  point  if 

d  BDx 2 

xTd  trace  ADX2 


As  A  is,  by  assumption,  positive  definite,  so  is  B  by  the  Schur  Product  Theorem;  and  so  BD  is 
nonsingular.  Therefore, 


cTd 


-x2  =  ( BD )“1d. 


trace  ADX2 

The  solution  x2  has  the  form  a{BD)~1d  for  a  scalar  a.  The  lhs  increases  monotonically  with  a,  and 
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so  the  solution  is  unique. 


□ 


Lemma  15.  The  iteration  (3.18)  is  invariant  to  symmetric  permutations. 


Proof.  Let  P  be  a  permutation  matrix.  Then 
xTd=  (Px)T(Pd) 

trace  ADX2  =  trac  e(PAPT)(PDPT)(PXPT)2. 

Hence  if  d  and  d+  satisfy  (3.18),  then 

(  Pd  (PBPT)(PDPT)  (Px)2 


Pd+  =  Pd - 


LU 


(Px)T(Pd)  trace  (PAPT)(PDPT)(PXPT)2  ^ 
and  so  Pd  and  Pd+  satisfy  (3.18)  for  the  matrix  PAPT. 


=  G(Pd ), 


□ 


Theorem  13.  d*  =  fd  is  a  point  of  attraction  of  the  iteration  (3.18)  if  0  <  to  <  n. 


Proof.  First  we  assume  A  is  irreducible. 
Let 


a(d)  =  ( dTx )  1 

/3(d)  =  (trace  ADX2)-1. 


Substituting  d*  yields  the  relations 
a(d*)  =  n-1£ 

(3(d*)  =  £2  (trace  H)-1  =  n-1£3, 


and  using  d^x  =  — X 2  and  dx trace  CX  =  diag(C)  for  a  matrix  C  yields  the  derivatives 


ad(d) 


/ Ud ) 


dTX2 

(i dTx )2 

2  diag (ADX3) 
(trace  ADX2)2 


ad(d*)  =  n  2x 


0d(d*) 


diag  (AX2) 


2n~2f3  diag(HX2). 


(trace  A)2 
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Then 


G(d)  =  d-oo  (a(d)d-  /3{d)BDx2 
Gd(d)  =  I  —  to  (dad(d)  —  BDx2/3d{d)  +  2(3(d)BDX^  =  J{d) 

J(d*)  =  I  —  lo  (n~2dxT  —  2 n~2(Bx  diag(AX2)T  +  2 n_1M2J  . 

Under  a  similarity  transform, 

XJ(d*)D  =  I -to  (n~2eeT  -  2n~2£XBi  diag  (AX)T  +  2  rT1XBX}  . 

As  Bx  =  d,  XBX  =  Xd  =  e;  and  diag(AX)  =  diag (D~1^2AD~1/2)  =  diag(A).  We  obtain 
XJ(d*)D  =  I  —  lo  [n~2eeT  —  2 n~2^e  diag(A)T  +  2n~1  XBX^j 


=  I  —  loti  1  (  n  ie 


e  —  2£  diag(A) 


2XBX^j  . 


Let  us  determine  the  interval  containing  the  eigenvalues  of  J(d*). 

1.  The  first  term  in  parentheses  has  rank  one  and  the  single  nonzero  eigenvalue 

n~1eT  e  —  2£  diag(A)  =  1  —  2 n-1£  trace  A  =  1  —  2  =  —1. 

2.  As  A  is  positive  definite  and  Xi  >  0,  so  is  B  and  XBX;  and  as  Bx  =  d ,  XBX  is  doubly 
stochastic.  Therefore,  XBX  has  eigenvalues  in  the  interval  (0,1],  and  as  we  are  temporarily 
assuming  that  A  is  irreducible,  the  eigenvalue  1  has  multiplicity  one. 

3.  As  e  is  an  eigenvector  of  each  of  the  two  terms  in  parentheses,  we  conclude  from  Lemma  11 
that 


n  1e 


e  —  2£  diag(A) 


2  XBX 


has  eigenvalues  in  the  interval  (0,2). 

4.  Finally,  if  0  <  lo  <  n,  XJ{d*)D ,  and  so  J(d*)  itself,  has  eigenvalues  in  the  interval 
1  -  wn_1(0,2)  C  (-1,1). 


Hence  p(Gd{d*))  <  1. 

Now  suppose  A  is  reducible.  By  Lemma  15,  we  can  assume  A  is  block  diagonal  and  each  block  is 
irreducible.  Our  analysis  to  this  point  applies  to  each  block  separately.  As  the  Jacobian  associated 
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with  each  block  has  spectral  radius  less  than  1,  so  does  the  Jacobian  as  a  whole. 

Hence  by  Theorem  10.1.3  of  [55],  the  iteration  d+  =  G(d )  converges  locally.  □ 


The  third  term 


The  denominator  of  the  third  term  of  (3.15)  need  not  be  yT s — for  example,  Theorems  12  and  13 
hold  with  only  minor  changes  to  their  statements  and  proofs  if  the  denominator  is  either  sTs  or 
yTy — but  numerical  experiments  show  that  yT s  is  best.  This  may  be  explained  as  follows.  Let  the 
eigenvectors  of  A  in  the  model  QP  (3.5)  be  u,;  and  the  eigenvalues,  Aj.  The  diagonal  of  A  may  be 
written 


diag(A)  =y2^~T~ 

Z - J  qy-L  q 


Suppose  s  is  an  eigenvector  of  A  and  has  associated  eigenvalue  A.  Then  y  =  As  =  As  and  so 
V2  A2s2  s2 

T  \  T  T  ’ 

y 1  S  AS 1  S  S1  S 

which  is  exactly  the  contribution  to  the  diagonal  of  A  that  the  eigenvector  s  makes. 


A  numerical  experiment 

The  analysis  in  this  section  makes  a  number  of  simplifying  assumptions,  and  it  does  not  do  more 
than  attempt  to  explain  the  effectiveness  of  the  diagonal  update  (3.15).  Now  we  describe  a  numerical 
experiment  that  supports  the  hypothesis  that  the  diagonal  update  equilibrates  the  Hessian. 

We  implemented  the  L-BFGS  method  [50]  in  a  solver  for  unconstrained  problems.  The  line  search 
routine  is  a  Matlab  translation  of  the  MiNPACK  routine  cvsrch  [46]  by  Dianne  O’Leary  [53].  We 
multiply  the  QN  matrix  by  the  scalar  a  described  in  Section  3.3.1.  We  tested  the  following  updates. 
The  colors  refer  to  the  corresponding  line  colors  used  in  the  plots  in  Figure  3.2. 

•  (black)  Bo  =  al,  where  cr  =  yTy/sTy  [40].  The  results  of  other  methods  should  be  compared 
against  those  of  this  one. 


•  (blue)  Just  s.  The  diagonal  update  is 


d+ 


n 


n  —  1 


ds2  A 

sTDs  J 


d. 


(3.19) 


When  d  oc  d,  each  element  of  the  parenthesized  vector  is  ( n  —  l)/n;  the  factor  n/(n  —  1)  in  the 
update  compensates  for  this. 


(red)  Just  Bs.  An  update  analogous  to  (3.19)  can  produce  an  indefinite  diagonal.  Instead,  we 
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use  an  update  analogous  to  (2.21): 


q  =  Bs 


d+  —  (1  -  w)  ttttp  + 
M\i 


\q2\\i 


We  use  ui  =  1/10. 

•  (green)  Just  y ,  except  for  a  scalar  formed  by  an  inner  product  with  s.  We  use  the  first  and 
third  terms  of  (3.15).  To  compensate  for  the  loss  of  the  term  involving  s,  we  multiply  d  by 
(n  —  1  )/n: 


d+  =  — d. 


sTy' 


•  (cyan)  The  algorithm  LMCBD,  which  uses  the  update  (3.15).  We  discuss  this  and  the  next 
algorithm  in  Section  3.3. 

•  (magenta)  The  algorithm  LMCBD-y,  which  uses  the  update  (3.15). 

The  test  problem  is  the  QP  /( x)  =  \xT  Ax.  A  is  constructed  as  follows.  First,  A  is  created  by 
generating  a  random  orthonormal  matrix  Q  and  a  random  diagonal  matrix  A  having  iid  diagonal 
elements,  each  the  square  of  a  random  normal  variable,  and  then  setting  A  =  QAQT .  Second,  a 
random  diagonal  scaling  matrix  D1^2  having  condition  number  20  is  generated  according  to  step  2 
of  the  first  experiment  in  Section  3.2.2.  Then  A  =  D1/2 AD1/2 .  The  results  we  show  in  Figure  3.2 
are  generic  over  other  distributions  and  condition  numbers  for  D1^2  and  A,  although  we  chose  an 
instance  from  several  trials  that  produced  the  clearest  plots. 

The  size  of  the  problem  is  n  =  100,  and  we  use  a  circular  buffer  that  can  store  np  =  10  pairs. 

The  four  plots  in  Figure  3.2  show  the  following: 

•  ||g||2-  The  2-norm  of  the  gradient. 

•  /.  The  function  value. 

•  nvar(X(A  o  A)x).  The  normalized  variance  of  the  2-norms  of  the  rows  of  the  Hessian  A  when 
scaled  according  to  B0.  The  constant  black  line  is  associated  with  the  constant  diagonal  al. 
The  blue  curve  shows  that  the  update  based  on  Ds  does  not  reduce  the  variance  of  the  row 
2-norms  as  effectively  as  the  other  updates.  The  other  updates  approximately  equilibrate  A. 

•  cond (R~T AR-1).  The  QN  matrix  can  be  thought  of  as  a  preconditioner,  and  so  this  plot 
shows  the  reduction  in  condition  number  of  A  when  preconditioned  by  two  separate  matrices: 
Bo  (curving  dashed  line)  and  the  full  QN  matrix  B  (solid).  The  three  horizontal  dashed  lines 
are,  from  top  to  bottom,  (a)  the  condition  number  of  A ,  (b)  the  condition  number  of  A  with 
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II  g  ll2 


100  200  300  400  500 


cond(R“T  A  FT1) 


Figure  3.2:  Numerical  experiment  for  various  diagonal  updates. 


Jacobi  scaling,  and  (c)  the  condition  number  of  A  with  exact  binormalization.  The  discussion 
regarding  the  relationship  between  binormalization  and  Jacobi  scaling  for  spd  matrices  in 
Section  2.2.3  explains  why  (b)  and  (c)  are  quite  close  to  each  other.  The  solid  curves — 
corresponding  to  preconditioning  by  the  full  QN  matrices-  fall  below  (c),  but  the  dashed 
curves — corresponding  to  preconditioning  by  just  Bq — level  off  at  (b)  and  (c),  as  we  expect. 

This  experimental  procedure  does  not  predict  the  performance  of  diagonal  updates  on  constrained 
problems:  when  implemented  in  SNOPT,  only  the  update  (3.15),  implemented  in  the  algorithm 
LMCBD  and  LMCBD-y,  proved  to  be  robust  over  the  range  of  test  problems  we  consider  in  Section 
3.5. 


3.3  The  limited-memory  quasi-Newton  methods: 

lmcbd  and  lmcbd-7 

In  this  section  we  describe  two  closely  related  algorithms  that  use  the  diagonal  update  (3.15).  We 
call  these  algorithms  LMCBD  and  LMCBD-y:  LM  is  for  limited  memory;  CB,  circular  buffer;  and  D, 
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diagonal  update. 

First  we  describe  a  scalar  that  calibrates  the  unit  step  length. 


3.3.1  A  scalar 


After  certain  steps,  the  QN  matrix  H  should  be  multiplied  by  a  scalar  to  calibrate  the  unit  step 
length.  SNOPT  performs  this  calibration  after  the  first  QN  update  following  the  reset  of  the  QN 
matrix  to  a  diagonal.  We  use  the  same  calibration  formula  as  SNOPT;  but  as  we  update  the  diagonal 
each  time  the  full  QN  matrix  is  updated,  we  find  it  best  to  calibrate  at  each  update. 

When  updating  B to  Bk+ 1 ,  the  basic  calibration  formula  is 


7  k 


Vksk  . 
s^BkSk  ’ 


(3.20) 


but  we  also  safeguard  the  value: 


7 k  =  min(100,  max(10  4,y *,)). 


(3.21) 


We  maintain  a  scalar  /3k  that  is  updated  by  pk+i  =  7 kPk-  After  updating  the  vector  dk  to  dk+ 1 
according  to  the  diagonal  update  (3.15),  we  set  B0  =  fik+idk+i- 

The  calibration  formula  (3.20)  has  at  least  one  natural  interpretation.  At  iterate  Xk,  we  have  a 
quadratic  approximation  to  the  Lagrangian: 

F(p )  =  \ PTBkp  +  glv  +  F0. 

The  solution  is  the  step  pk  =  —B~jll  gk-  Next,  the  line  search  determines  the  step  size  a^.  The  new 
iterate  is  Xk+ 1  =  Xk  +  otkPk ,  and  the  new  gradient  is  gk+i-  As  Sk  =  Xk+ 1  —  Xk  =  ctkPk , 


Rk^k  (%kgk-  (3.22) 

Given  this  information,  we  would  like  to  determine  a  scalar  7 k  such  that  if  Bk  =  'YkBk,  then  the 
unit  step  is  likely  acceptable.  One  obvious  choice  is  jk  =  07  ,  for  then  BkSk  =  ct^1BkSk  =  —gk 
and  the  corresponding  unit  step  has  already  been  accepted.  But  this  choice  ignores  the  additional 
information  gk+i- 

Let 

/(«)  =  PkFiaPk )  =  +  ba  +  c 
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be  a  scalar  quadratic  function  in  the  step  size  a.  Then 
f'(a)  =  aa  +  b. 


As 


/'( 0)  =  b  =  plgk 
f{ak)  =aak  +  b  =  pkgk+i, 


we  obtain 

=  Plidk+i  -  9k)  =  pjvk 
ak  ak 

b  =  plgk ■ 


Minimizing  f(a)  gives  a  step  size  that  can  be  expected  to  be  better  than  a /'(a)  =  0  if 


a 


b  cnkpkgk  _  akskgk  _  slBksTk 


vlvk 


slVk 


slVk 


where  the  last  equality  follows  from  (3.22).  Finally,  we  set  jk  =  d  . 

An  obvious  question  is  whether  one  could  do  better  based  on  information  internal  to  the  line 
search:  fit  a  higher-order  polynomial  than  a  quadratic,  for  example.  In  fact,  the  unit  step  should  be 
taken  frequently  as  the  algorithm  converges,  and  so  the  quadratic  fit  uses  all  the  information  that 
is  typically  available  without  expending  any  extra  user  function  calls. 

A  less  quantitative  interpretation  of  the  calibration  formula  is  revealing.  s^yk  is  the  change  in 
the  gradient  of  the  Lagrangian  along  sk.  skBksk  is  the  expected  change  in  the  gradient  based  on 
the  quadratic  approximation  F(p).  Hence  7*,  >  1  when  the  actual  change  is  larger  than  expected, 
and  7^  <  1  when  it  is  smaller. 

The  classical  L-BFGS  method  of  Nocedal  [50]  uses  the  scalar  ak  =  yk  H0yk/y^ sk,  where  typically 
Hq  =  I.  The  interpretation  of  this  scalar  is  that  it  broadly  accounts  for  the  magnitude  of  the 
curvature  of  the  problem.  In  Section  3.5,  we  compare  "fk  with  this  classical  one  on  unconstrained 
problems  and  find  7 k  gives  slightly  superior  performance. 

The  update  (3.15)  is  the  same  as  one  of  those  considered  in  [22,  64,  65],  but  our  scalar  is  quite 
different.  Their  scalar  is 


vk  = 


vlDklVk 

T  ‘ 

skVk 


which  is  the  same  as  ak  for  Hq  =  Dk  1 .  Using  vk  on  constrained  problems  is  disastrous;  and  as  we 
demonstrate  in  Section  3.5,  uk  is  not  as  good  as  jk  on  unconstrained  problems. 
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Our  view  is  that  7*,  is  superior  to  both  vk  and  ak  because  it  is  motivated  by  the  most  natural 
interpretation:  not  as  a  means  to  account  for  the  magnitude  of  curvature,  but  rather  as  a  means  to 
calibrate  the  unit  step  length  given  sufficient  information  to  form  a  quadratic  model  of  the  merit 
function  projected  along  the  current  step  direction. 

3.3.2  LMCBD 

Now  we  can  assemble  our  primary  algorithm,  LMCBD.  If  s^yk  is  sufficiently  positive,  the  update  is 
as  follows: 

1.  Update  the  diagonal  dk  to  dk+i  according  to  (3.15). 

2.  If  cond(diag(dfc+i))  exceeds  a  tolerance,  reinitialize  the  quasi-Newton  approximation,  set 
dk- 1-1  =  e,  reset  /3k+ 1,  and  go  to  step  5. 

3.  Compute  jk  according  to  (3.20)  and  (3.21). 

4.  Set  pk+i  =  7 kPk- 

5.  Set  B0  =  /3fc+:Ldiag(c4+i). 

6.  Update  the  pair  buffer  with  the  new  pair  {sk,yk}. 

In  step  2,  other  choices  are  possible;  for  example,  we  could  set  Bq  =  crl,  where  a  =  ykyk/ s^yk, 
and  retain  the  pair  buffer.  However,  we  have  found  that  on  the  rare  occasions  that  the  condition 
number  of  diag(dfc+i)  exceeds  our  tolerance,  it  is  likely  that  the  current  pairs  are  not  helpful  for 
finding  a  good  descent  direction.  /30  could  be  set  to  1;  but  as  it  is  available  to  us  in  SNOPT,  we  use 
as  a  better  initial  estimate  the  scalar  U0pre**2,  a  description  of  which  is  beyond  the  scope  of  this 
chapter. 

3.3.3  LMCBD-7 

When  the  iterates  are  converging,  d  may  remain  approximately  the  same.  But  how  does  d  behave 
during  the  initial  iterations?  In  fact,  making  a  general  statement  is  not  possible:  the  behavior  of 
d  in  the  early  iterations  depends  very  much  on  the  problem.  We  have  observed,  particularly  when 
the  degrees  of  freedom  in  a  problem  is  large  and  so  a  limited-memory  method  will  typically  require 
a  large  number  of  iterations,  that  d  may  change  quickly  during  early  iterations  and  so  be  unhelpful. 
In  contrast,  on  problems  having  few  degrees  of  freedom,  the  total  number  of  major  iterations  can 
be  quite  small,  and  so  a  quickly  changing  d  can  be  useful. 

In  any  case,  the  problem  is  not  severe;  and  we  certainly  do  not  recommend  using  the  modified  form 
of  LMCBD  we  describe  in  this  section  for  anything  other  than  unconstrained  problems.  Our  primary 
motivation  for  this  section  and  the  subsequent  numerical  experiments  involving  both  LMCBD  and 
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LMCBD-7  is  that  Veerse  and  Auroux  [65]  considered  essentially  both  algorithms — using  iy,  rather  than 
7 k — among  the  many  other  diagonal  updates  they  tried,  and  could  not  make  a  firm  recommendation 
in  favor  of  one  over  the  other.  Our  objective,  then,  is  to  understand  how  the  two  algorithms  differ. 

Suppose  7a,  is  much  larger  than  1.  Then  evidently  the  step  size  ak  was  quite  small,  and  the 
current  model  of  the  second-order  information  greatly  overpredicts  the  extent  to  which  the  iterates 
should  differ.  In  that  case,  it  may  be  useful  to  prevent  d  from  changing  by  much.  Next,  suppose  7*, 
is  much  smaller  than  1.  Then  the  step  size  a &  was  quite  large,  and  so  the  iterates  greatly  differ.  In 
this  case,  it  may  be  useful  to  allow  d  to  change  greatly. 

In  LMCBD,  (3k  =  lij=o  7fc-  I11  LMCBD-7,  we  set  /3k  =  1  and  accumulate  7 directly  in  the  diagonal 
dk- 


d+ 


yd  + 


The  inclusion  of  7  in  the  diagonal  update  moderates  the  influence  of  the  term  y2 /yT s.  If  7  is  large, 
the  term  has  a  relatively  smaller  effect;  if  7  is  small,  larger. 


3.4  Implementation 

The  software  implementations  of  the  diagonal  update  (3.15)  and  the  scalar  /?  are  straightforward, 
as  each  involves  only  element-wise  or  inner  products  among  the  vectors  dk,  Sk,  and  yk- 

The  critical  part  of  the  implementation  is  the  limited-memory  quasi-Newton  method  itself.  It 
must  handle  a  continually  updated  diagonal  initial  matrix  B0  and  a  circular  buffer. 

3.4.1  Implementation  in  SNOPT  7  and  8 

Three  computational  requirements  guide  our  implementation.  First,  SNOPT  solves  the  quadratic 
program  (QP)  subproblem  using  a  null-space  method.  This  QP  solver  accesses  the  quasi-Newton 
matrix  through  matrix-vector  products.  Second,  the  implementation  should  allow  for  a  fast  update. 
Third,  storage  should  be  minimal:  the  optimal  amount  of  storage  is  (2 np  +  l)n:  storage  for  np  pairs 
and  one  vector  for  the  diagonal. 

The  product-form  update  maintains  numerical  positive  definiteness,  and  it  may  be  that  this 
concern  should  have  priority.  However,  we  have  implemented  a  summation-form  update  and  have 
not  experienced  numerical  problems.  For  a  circular  buffer,  the  summation  form  requires  less  storage. 

The  product-form  stores  the  pairs  {s.,-,  u,}  (see  Section  3.1.3).  Vj  is  a  linear  combination  of  y.j 
and  qj.  When  a  pair  is  removed  from  the  circular  buffer,  a  new  Bj_  1  corresponds  to  the  old  Sj ,  and 
so  qj  and  Vj  are  incorrect.  Therefore,  yj  must  be  stored  whether  or  not  Vj  is.  But  storage  for  either 
Vj  or  qj  must  remain,  as  the  product-form  matrix-vector  product  depends  on  one  of  these.  Hence 
the  product  form  requires  storage  for  3 np  +  1  n-vectors. 
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Byrd,  Nocedal,  and  Schnabel’s  [13]  compact  representation  of  a  limited-memory  BFGS  method 
is  useful.  Let 


Du  —  Sj  yi 

Lij  =  sf  yj  for  i  >  j. 


Their  representation  is  as  follows: 


JJT  =  StBqS  +  LD~1Lt  =  C 


(3.23) 


Theorem  2.4  of  [13]  shows  that  if  Bg  is  p.d.  and  sfyi  >  0,  then  C  is  p.d.,  and  so  the  Cholesky 
factorization  C  =  JJT  exists. 

Let  k  be  the  number  of  pairs  in  the  buffer;  here  we  reserve  np  for  the  number  of  pairs  the 
buffer  can  hold.  At  a  particular  major  iteration,  the  diagonal  is  updated,  and  so  ST BgS  must  be 
computed  anew  in  0(k2ri)  operations.  Then  the  Cholesky  factorization  C  =  JJT  is  computed  in 
0(k3)  operations. 

A  matrix-vector  product  u  <—  Bu  requires  0(kn )  operations: 

(  YT  \ 

1.  Set  v  <—  it. 

\stbq) 

2.  Solve  Mw  =  v. 


3.  Set  u  <—  B0u  —  (y  B0S^j 


3.4.2  SNOPT  in  the  future:  Block-LU  QP  solver 

A  future  version  of  SNOPT  will  use  a  new  QP  solver  that  uses  the  block-LU  method.  The  block- 
LU  method  is  particularly  suited  to  large  problems  in  which  the  number  of  active  constraints  is 
substantially  fewer  than  the  number  of  variables;  in  this  situation,  the  reduced  Hessian  in  the  null- 
space  approach  becomes  excessively  large.  Our  limited-memory  method  is  well  suited  to  problems 
on  which  the  block-LU  method  will  excel,  for  in  such  problems,  second-order  information  assumes  a 
larger  role  relative  to  constraints  in  determining  the  search  direction. 
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Consider  the  quadratic  program 


subject  to  Ax  >  0. 


At  a  particular  iteration,  let  A0  be  the  matrix  corresponding  to  the  current  active  set.  The  KKT 
matrix  is 


K0  = 


f) 


LqUo-i 


where  Kq  =  LqUq  is  a  factorization  of  K.  The  block-LU  method  handles  updates  to  the  active  set 
by  appending  blocks  V  and  D  to  K0  and  computing  a  block  factorization  of  the  resulting  KKT 
system: 


fRo  v)  =  fL0  \  fu0  Y\ 

\vT  d)  { z -  i){  cj- 


(3.24) 


See  Hanh  Huyhn’s  thesis  [29]  for  more  about  the  method. 


Suppose  Bk  =  blgs({sj ,  yj}j=1,  B0).  First  we  consider  the  product-form  BFGS  matrix.  Let  Vj 
and  qj  be  as  they  are  defined  in  Section  3.1.3,  and  recall  sjqj  =  < pj1 .  The  solutions  y\  and  2/2  are 
the  same  in  the  following  two  systems  [29]: 


ft  ■' j  (I)  ■  (I) 


/  B() 

AT  M/\ 

1 y 1s 

(dl) 

A 

V2 

= 

d2 

viuT 

D) 

\r) 

1°; 

where  W  =  yqi  v\ 
'b0  Wt' 


Qk  Vk )  i  and  D  is  block  diagonal  and  has  j th  block 


1 


if 


Kn  = 


w 


,  then  the  second  system  has  the  form  (3.24),  and  so  the  block-LU  method  can 


efficiently  incorporate  a  limited-memory  BFGS  matrix  in  product  form.  If  instead  the  BFGS  matrix 
is  expressed  in  summation  form — =  B0  +  UUT  —  ZZT — then  we  can  set  [25]  W  =  (jj  zj  and 

D  =  |  | .  U  and  Z  can  be  expressed  in  at  least  two  ways.  The  summation  form  (3.3)  shows 


-I, 
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that 


uj  ~  1/2 Vo  and  Zj  =  (sj Qj)  1/2 Qj- 

Equivalently,  in  terms  of  the  compact  representation’s  matrix  F  in  (3.23), 

(ll  Z)  FT  =  (y  B0S )  •  (3.25) 

Let  us  consider  storage  and  computation  requirements  in  each  case: 

1.  Store  {sj,  Vj}.  The  updated  diagonal  and  circular  buffer  require  storing  the  y:j  vectors  as  well. 
At  each  major  iteration,  each  Vj  vector  is  recomputed  for  a  total  of  0(k2n)  operations.  If  the 
block-LU  QP  solver  uses  the  product  form,  the  {sj,  Vj}  pairs  can  be  used  without  modification. 
Hence  the  minimum  storage  is  (3np  +  l)n  vectors. 

2.  Store  {qjjVj}-  Again,  a  third  set  of  vectors  must  be  stored,  in  this  case  Sj.  At  each  major 
iteration,  each  qj  vector  is  recomputed  for  a  total  of  0{k2n)  operations.  If  the  block-LU  QP 
solver  uses  the  summation  form,  the  {qj,Vj}  pairs  can  be  used  without  modification.  Hence 
the  minimum  storage  is  again  (3np  +  l)n  vectors. 

3.  Store  {sj,yj}.  The  SQP  method  uses  the  compact  representation.  At  each  major  iteration, 
StBqS  must  be  recomputed  in  0{k2n)  operations.  The  block-LU  QP  solver  can  use  either 
the  product  or  the  summation  forms;  in  either  case,  new  pairs  {'Uj.  z:]}  must  be  computed  in 
0(k2n)  operations  using  (3.25).  Hence  the  minimum  storage  is  (4np  +  1  )n  vectors. 

In  all  three  cases,  the  diagonal  Bq  can  be  used  without  modification  in  the  QP  solver.  This  analysis 
shows  that  if  memory  is  shared  between  the  outer  SQP  and  inner  QP  solvers,  either  of  cases  1  and 
2  is  optimal. 

However,  this  conclusion  may  be  premature.  A  block-LU  QP  solver  accesses  the  Hessian  approx¬ 
imation  directly  rather  than  through  matrix-vector  products  and  so  can  advantageously  use  external 
factorization  libraries  [29].  It  is  quite  likely  that  the  QP  solver  must  store  its  own  pairs  {uj,  Zj}  and 
diagonal  Bo,  as  these  make  up  some  of  the  elements  of  the  block-LU  matrix.  Hence  the  QP  solver 
requires  (2 np  +  1  )n  vectors  of  storage  independent  of  the  outer  SQP  solver.  Meanwhile,  the  SQP 
solver  requires  the  storage  indicated  in  each  of  the  three  cases:  either  (3 np  +  l)n  (cases  1  and  2)  or 
(2np  +  l)n  (case  3).  In  all  three  cases,  0(k2n)  operations  must  be  performed  at  each  major  iteration. 
Hence  the  compact  representation  may  have  the  same  advantage  for  an  SQP  method  based  on  a 
block-LU  QP  solver  as  for  one  based  on  a  null-space  QP  solver. 
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3.4.3  Software 

We  implemented  our  method  in  SNOPT.  The  user  has  the  choice  of  setting  the  parameter  Hessian 
either  to  Limited,  which  is  the  old  limited-memory  method,  or  to  LMCBD,  which  is  our  new  method. 
The  default  limited-memory  method  is  Limited.  Except  for  increasing  the  number  of  possible  values 
for  the  parameter  Hessian,  our  method  does  not  change  SNOPT’s  user  interface. 

We  recommend  using  the  new  Hessian  option  on  problems  for  which  evaluating  the  user  function 
and  gradients  is  time  consuming  relative  to  SNOPT’s  computations.  In  particular,  for  problems 
having  only  linear  constraints  and  for  which  the  objective  is  evaluated  quickly,  the  original  Hessian 
=  Limited  option  is  probably  better. 

For  our  numerical  experiments  on  unconstrained  problems,  we  also  implemented  our  method  and 
several  others  in  the  software  lbfgs.f  [50]. 


3.5  Numerical  experiments 

Our  objective  in  our  numerical  experiments  is  to  investigate  the  effectiveness  of  our  algorithms  as 

well  as  the  robustness  of  the  SNOPT  modification. 

Framework 

We  use  several  test  sets: 

•  COPS  3.0  [20].  The  COPS  3.0  test  set  is  a  set  of  high-quality  constrained  problems  imple¬ 
mented  in  AMPL.  We  discard  the  quadratic  programs,  leaving  20  problems.  Each  problem 
but  one  has  three  sizes,  and  one  has  four. 

•  CUTEr/Gill  and  Leonard  [24],  A  set  of  51  unconstrained  problems.  We  include  an  additional 
problem,  dixmaanj ,  because  it  is  a  member  of  a  set  of  problems  Gill  and  Leonard  include. 

•  CUTEr/Kaustuv  [31].  41  of  the  42  constrained  problems  in  Kaustuv’s  thesis.  We  discard 
degen2  as  we  did  not  find  it  in  our  CUTEr  archive. 

•  CUTEr/Byrd,  Lu,  Nocedal,  and  Zhu  [12].  33  of  the  34  bound-constrained  problems  from 
their  test  set.  One  problem,  hs25,  was  removed  because  for  an  unresolved  reason,  the  CUTEr 
framework  considered  the  problem  solved  after  0  major  iterations.  The  number  of  function 
calls  indicated  in  the  tables  of  results  in  [12]  suggest  those  authors  had  the  same  issue. 

•  GPOPS  [58].  13  optimal  control  problems  distributed  with  the  GPOPS  system. 

•  The  34  general  problems  from  SNOPT’s  examples  directory. 

•  CUTEr/lbfgs .  f .  152  unconstrained  problems. 
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Ibfgs.f/CUTEr:  Function  calls  (N=125) 


(a) 


(b) 


Ibfgs.f/CUTEr:  Function  calls  (N=134) 


Problem  number  n 


Ibfgs.f/CUTEr/Leonard:  Function  calls  (N=46) 


(e)  (f) 

Figure  3.3:  Results  for  the  original  and  modified  versions  of  lbfgs.f. 


SNOPT  is  run  on  all  but  the  final  test  set;  the  original  and  modified  forms  of  lbfgs.f  are  run  on 
the  latter. 
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(a) 


COPS3.0/10  pairs:  Function  calls  (N=52) 


COPS3.0/10  pairs:  Function  calls  (N=55) 


(b) 

COPS  3.0/10  pairs:  Time  (N=55) 


Problem  number  n 


(c) 


COPS3.0, 10  pairs,  reduced  Hessian  50:  Function  calls  (N=53) 


Problem  number  n 


(d) 


(e) 


Figure  3.4:  Results  for  SNOPT  on  the  COPS  3.0  test  set. 


When  we  plot  the  results  of  multiple  algorithms  together,  we  have  to  discard  problems  for 
which  different  solutions  are  obtained.  We  also  discard  any  problem  on  which  every  algorithm  fails. 
Consequently,  the  number  of  problems  in  the  figures  may  differ  from  the  numbers  indicated  in  this 
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Problem  size  n 


(a) 


GPOPS:  Function  calls  (N=12) 


(b) 


(c) 


GPOPS:  Function  calls  (N=12) 


Figure  3.5:  Results  for  SNOPT  on  the  GPOPS  example  problems. 


list. 
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(a) 


CUTEr/Kaustuv/10  pairs:  Function  calls  (N=37) 


CUTEr/Kaustuv/10  pairs:  Function  calls  (N=38) 


(b)  (c) 

Figure  3.6:  Results  for  SNOPT  on  the  CUTEr/Kaustuv  test  set. 


As  in  Chapter  2,  we  use  performance  profiles.  Our  primary  performance  metric  is  the  number 
of  user  function  evaluations;  for  two  test  sets,  we  also  assess  CPU  time.  We  use  a  second  plot  that 
shows  performance  in  a  different  way.  Associated  with  each  algorithm  is  a  sorted  list  of  metric 
values,  one  element  for  each  problem.  The  sorted  list  is  cumulatively  summed  and  the  result  is 
plotted.  The  curve  associated  with  a  better-performing  algorithm  increases  more  slowly  than  that 
associated  with  a  worse-performing  algorithm. 

In  plots  showing  problem  sizes,  the  size  refers  to  the  number  of  variables  in  the  problem. 

We  use  the  following  specification  file  for  SNOPT  on  the  CUTEr  problems: 

Begin  SNOPT-Cuter  NLP  problem 


Major  Print  level 

000001 

Minor  print  level 

0 

Major  iteration  limit 

10000 

Iterations  limit 

100000 

Solution 

no 

Hessian  Limited 


3.5.  NUMERICAL  EXPERIMENTS 


77 


(a) 


CUTEr/Gill  and  Leonard/10  pairs:  Function  calls  (N=42) 


Problem  number  n 


CUTEr/Gill  and  Leonard/10  pairs:  Function  calls  (N=46) 


Problem  number  n 


(b) 


(c) 


Figure  3.7:  Results  for  SNOPT  on  the  CUTEr/Gill  and  Leonard  test  set. 


END  SNOPT-Cuter  NLP  problem 

We  use  the  default  options  for  SNOPT  when  running  on  the  COPS  3.0  problems.  For  the 


lbfgs.f  code 

in  the  CUT' 

5 

M 

10 

IPRINT(l) 

0 

IPRINTC2) 

100000 

MAXIT 

l 

diagopt 

0.00001 

EPS 

Results  for  modifications  to  lbfgs.f 

Figure  3.3  shows  results  for  lbfgs.f  on  the  test  set  of  CUTEr  unconstrained  problems.  We  tested 
the  following  methods: 


•  L-BFGS.  The  original  method. 
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(a) 


CUTEr/Byrd,  Lu,  Nocedal,  Zhu/10  pairs:  Function  calls  (N=26) 


CUTEr/Byrd,  Lu,  Nocedal,  Zhu/10  pairs:  Function  calls  (N=29) 


14000 
12000 
c  10000 
~  8000 
|  6000 
4000 
2000 

5  10  15  20  25 

Problem  number  n 


(b) 


(c) 


Figure  3.8:  Results  for  SNOPT  on  the  CUTEr/Byrd,  Lu,  Nocedal,  and  Zhu  test  set. 


•  L-BFGS-y.  The  original  method,  but  au  is  replaced  with  7 

•  LMCBD.  Our  primary  algorithm. 

•  LMCBD-y.  Our  secondary  algorithm  in  which  7/-  is  used  to  moderate  the  diagonal  update. 

•  lmcbd-za  Our  secondary  algorithm,  but  y*,  is  replaced  with  Vk- 

Every  modification  performs  better  on  average  than  the  original  method.  Figure  3.3(c)  shows 
that  LMCBD-y  outperforms  LMCBD-^:  even  in  the  case  of  unconstrained  problems,  the  scalar  7  is 
evidently  better  than  v.  Similarly,  3.3(d)  shows  that  L-BFGS-y  slightly  outperforms  L-BFGS:  7  is 
also  better  than  <7. 

Although  LMCBD  outperforms  L-BFGS,  LMCBD-y  is  the  best  of  the  algorithms  tested. 

Figure  3.3(f)  shows  results  for  just  the  subset  of  problems  that  are  in  the  CUTEr/Leonaixl  set. 
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SNOPT  examples:  Function  calls  (N=22) 


2X 


Figure  3.9:  Results  for  SNOPT  on  the  SNOPT  example  problems. 

Results  for  SNOPT 

We  tested  the  modified  SNOPT  on  six  test  sets  having  among  them  four  interfaces:  AMPL,  CUTEr, 
Matlab,  and  Fortran.  We  tested  three  modifications  to  SNOPT: 

•  Limited.  SNOPT’s  current  Hessian  approximation. 

•  LMCBD.  Our  primary  algorithm. 

•  LMCBD-'y.  Our  secondary  algorithm  in  which  7*.  is  used  to  moderate  the  diagonal  update. 

•  L-BFGS-7.  Our  diagonal  update  is  replaced  by  7*./.  Note  that  7,  rather  than  er,  is  used. 

We  feel  our  most  important  results  are  those  shown  in  Figure  3.4  for  the  COPS  3.0  test  set.  The 
test  set  has  generally  constrained  problems  of  high  quality,  and  so  we  believe  that  the  results  on  this 
set  are  particularly  meaningful.  Figure  3.4(b)  shows  results  for  all  the  methods.  First,  both  forms 
of  the  LMCBD  algorithm  outperform  the  other  methods.  Second,  L-BFGS  outperforms  Limited  in 
general,  but  it  does  quite  a  bit  worse  on  a  few  problems.  We  conclude  that  both  a  circular  buffer  and 
the  diagonal  update  contribute  to  LMCBD’s  success.  Figures  3.4(c,d)  show  results  just  for  LMCBD 
and  the  original  method.  The  second  of  the  two  uses  CPU  time  in  seconds  as  the  performance 
metric.  Figure  3.4(e)  shows  that  LMCBD  remains  effective  even  when  SNOPT  is  forced  to  use  the 
conjugate  gradient  algorithm,  rather  than  the  Cholesky  factorization,  to  solve  the  reduced  Hessian 
system  once  the  reduced  Hessian’s  size  exceeds  50.  We  use  the  default  setting  CG  Iterations  = 
100.  Until  seeing  the  results  of  this  test,  one  might  be  concerned  that  the  richer  information  in 
LMCBD’s  Hessian  approximation  could  limit  the  conjugate  gradient  algorithm’s  ability  to  reduce  the 
residual  of  the  reduced  Hessian  system  in  the  limited  number  of  iterations  it  is  permitted. 

Figure  3.5  shows  results  for  the  GPOPS  example  problems.  Figures  3.5(b,c)  show  the  results 
including  and  excluding,  respectively,  the  results  for  L-BFGS,  as  including  this  algorithm  reduces  the 
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relatively  small  test  set  size  by  2  because  of  different  local  minimizers.  Figure  3.5(e)  uses  CPU  time 
as  the  performance  metric.  The  results  are  qualitatively  similar  to  those  for  the  COPS  3.0  test  set. 

Figure  3.6  shows  results  for  the  CUTEr/Kaustuv  set  of  constrained  problems.  LMCBD-y  fares 
relatively  poorly,  but  LMCBD  continues  to  be  successful. 

Figure  3.7  shows  results  for  the  CUTEr/Gill  and  Leonard  test  set  of  unconstrained  problems. 
Figure  3.7(b)  shows  results  similar  to  Figure  3.3(b,f):  LMCBD-y  is  superior  to  LMCBD  on  uncon¬ 
strained  problems.  We  discuss  our  interpretation  of  these  results  in  the  next  section. 

Figure  3.8  shows  results  for  the  CUTEr/Byrd,  Lu,  Nocedal,  and  Zhu  test  set  of  bound-constrained 
problems.  Results  are  similar  to  those  on  the  other  test  sets  of  constrained  problems. 

Finally,  as  a  further  check  of  the  robustness  of  the  software,  we  test  SNOPT  using  the  old  and 
new  Hessian  approximations  on  the  problems  in  SNOPT’s  examples  directory.  The  results  are 
shown  in  Figure  3.9. 

3.6  Summary  and  conclusions 

This  chapter  describes  a  limited-memory  quasi-Newton  method,  LMCBD,  that  combines  a  diagonal 
update  with  a  limited-memory  quasi-Newton  method  based  on  a  circular  buffer. 

Theorem  11  and  subsequent  analysis  suggests  that  the  proper  interpretation  of  a  diagonal  initial 
matrix  Bq  is  that  it  scales  the  problem.  Therefore,  an  effective  Bq  is  one  that  reduces  the  condition 
number  of  the  Hessian  by  scaling.  Motivated  by  this  observation,  we  developed  a  diagonal  update 
that  theoretical  analysis — albeit  quite  approximate — and  numerical  experiments  show  equilibrates 
the  Hessian  matrix.  We  combined  the  diagonal  update  with  a  limited-memory  quasi-Newton  method 
based  on  a  circular  buffer  to  create  the  method  we  call  LMCBD. 

We  also  discussed  the  importance  of  scaling  the  Hessian  approximation  by  a  scalar.  This  topic  has 
received  attention  for  several  decades.  <r  is  often  used;  and  v  was  introduced  along  with  a  diagonal 
update  like  ours  by  previous  authors,  a  and  v  are  interpreted  as  approximating  the  magnitude  of 
the  curvature  in  the  part  of  the  space  for  which  curvature  information  is  missing.  In  contrast,  we 
interpret  the  scalar  multiple  as  calibrating  the  unit  step  based  on  the  most  recent  information.  This 
observation  motivates  our  adopting  7  as  LMCBD’s  scalar  multiple. 

We  implemented  our  algorithms  in  lbf  gs  .  f  for  experimental  purposes  and  in  SNOPT  for  eventual 
release.  We  conducted  a  broad  range  of  numerical  experiments  and  made  several  observations: 

•  7  is  an  effective  scalar  multiple  and  appears  to  be  better  than  a  and  v . 

•  LMCBD  is  an  effective  algorithm,  although  LMCBD-7  performs  better  on  the  test  set  of  uncon¬ 
strained  problems.  We  believe  the  latter  holds  because  7  moderates  the  change  in  the  diagonal. 
The  unconstrained  problems  in  our  test  set  typically  require  many  more  iterations  than  the 
other  test  problems,  and  so  moderate  changes  in  the  diagonal  are  beneficial.  In  contrast,  when 
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relatively  few  major  iterations  are  performed  on  constrained  problems,  a  rapidly  changing 
diagonal  is  useful. 

•  Using  a  circular  buffer  in  SNOPT,  regardless  of  the  form  of  Bq ,  improves  performance  on  a 
number  of  problems,  but  on  some  problems,  performance  is  greatly  reduced. 

•  Combining  a  circular  buffer  with  a  diagonal  update,  as  is  done  in  LMCBD,  produces  a  robust 
Hessian  approximation  that  on  average  outperforms  SNOPT’s  current  method. 


Chapter  4 

Interpolating  quasi-Newton  data 
from  a  coarse  to  a  fine  mesh 


A  differential  equation  is  often  solved  by  discretizing  it  on  a  sequence  of  increasingly  fine  meshes. 
Similarly,  an  optimization  problem  involving  functionals  and  differential  equations  can  be  solved  by 
discretizing  the  functionals  to  produce  a  sequence  of  numerical  optimization  problems  defined  on 
increasingly  refined  meshes.  In  this  chapter,  we  develop  a  method  to  interpolate  our  limited-memory 
quasi-Newton  matrix  from  a  coarse  to  a  fine  mesh  for  problems  of  this  type. 

There  is  a  large  literature  on  differential-equation-constrained  optimization  problems  (DECP). 
Roughly  two  types  of  problems  are  studied,  generally  by  different  researchers:  optimal  control  prob¬ 
lems ,  such  as  problems  concerning  the  trajectory  of  a  rocket,  that  usually  involve  the  ordinary 
differential  equations  of  mechanics  [10,  3,  58];  and  partial  differential  equation-constrained  ( PDE - 
constrained)  problems  [4].  Additionally,  many  researchers  have  studied  methods  that  solve  problems 
on  multiple  meshes  of  varying  fineness:  for  example,  geometric  and  algebraic  multigrid  methods  [6]. 

Researchers  have  used  quasi-Newton  methods  to  solve  DECPs.  Almost  30  years  ago,  Griewank 
and  Toint  [27]  proposed  a  specialized  quasi-Newton  method  for  objectives  that  are  a  sum  of  convex 
element  functions.  The  QN  matrix  is  partitioned  according  to  the  objective.  They  remarked,  though 
did  not  describe  in  detail,  that  one  can  propagate  the  element  Hessians  from  one  mesh  to  another. 
Kelley  and  Sachs  [32]  developed  rather  technical  conditions  on  the  initial  matrix  Bq  so  that  a  quasi- 
Newton  method  takes  a  number  of  iterations  independent  of  the  mesh  fineness. 

Interpolating  a  limited-memory  quasi-Newton  matrix  has  not  received  attention  for  a  simple 
reason:  in,  say,  L-BFGS,  the  data  are  not  sufficiently  valuable  to  justify  the  complexity  of  interpo¬ 
lating  them  from  a  coarse  grid  to  a  fine  one.  In  contrast,  the  diagonal  matrix  Bq  in  LMCBD  is  quite 
valuable,  and  so  interpolation  may  be  justified.  This  chapter  develops  such  a  method. 
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4.1  Interpolation  rules 

The  primary  tool  of  our  method  is  a  set  of  three  interpolation  rules:  one  each  for  s,  y,  and  d,  the  quasi- 
Newton  pair  and  diagonal  of  the  diagonal  matrix  B0.  We  establish  these  rules  in  the  simpler  setting 
of  approximating  integrals,  then  consider  differential-equation-constrained  optimization  problems. 


4.1.1  Integrals 


Consider  the  functional 

H[u}=  /  h(x,  u,  dxu,  dxu, . . .)  dx,  (4.1) 

Jn 

where  dxu  denotes  the  *th  partial  derivative  of  u.  The  first  variation  of  H[u ]  is 

SH[u]  =  f  Sh  dx, 

J  n 

where  S  denotes  the  variation.  Sh  may  be  quite  complicated,  but  if  h  and  u  are  sufficiently  smooth, 
as  we  assume  here,  and  the  variation  5u  is  smooth  by  construction,  then  so  is  Sh.  (For  more  on  the 
variational  calculus,  see,  for  example,  [39].) 

In  an  optimization  problem,  boundary  conditions  must  be  provided.  For  simplicity  in  the  present 
exposition,  we  assume  the  boundary  values  of  u  are  fixed  and  so  the  variation  Su  is  0  on  the  boundary 
80.  Consequently, 


/  Sh  dx  =  /  Hu  Su  dx 
Jn  Jn 

for  a  differential  operator  H  and  a  smooth  variation  Su:  the  boundary  terms  that  arise  from  inte¬ 
grating  by  parts  vanish. 

We  consider  discretizations  of  the  integral  in  which  the  approximation  has  one  of  two  forms: 
either  it  is  centered  at  the  nodes,  or  it  is  centered  at  cell  centers,  u  is  approximated  by  u  and  the 
variation  Su  by  Su.  We  assume  both  barred  values  are  defined  at  the  nodes.  The  approximation  has 
the  form 


H  u  Su  dx 


y  [  dtSu^f.ii . 
ie  N 


(4.2) 


where  (f>i  is  the  approximation  to  Hu  at  node  location  x,,  J\f  is  the  set  of  nodes,  and  is  the  weight 
associated  with  node  i. 

From  the  other  direction,  H[u\  may  be  approximated  by  a  node-centered  or  a  cell-centered 
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expression.  For  generality,  we  write 

H[u\  «  y  hu{u)Hi  +  '^2h2i{u)vi  =  H{u)\ 
ieAT  i£C 

C  is  the  set  of  cell  indices,  and  is  the  weight  associated  with  cell  i.  For  future  use,  Xi  is  the  location 
of  the  ith  cell  center.  The  differential  of  H{u )  is 

d uH{u)  =  y  daiH(u)dui. 


If  diti  =  5ui ,  then  SH[u]  ~  d fliL(u)  and 
«  y  (w)d'u,;. 

ieN  i£j\f 

Consequently,  we  make  the  identification  ~  fa  Hi-  Additionally,  by  the  approximation  (4.2), 

daiH(u)/^i  «  Hu\x=Xi.  (4.3) 


Suppose  we  have  two  meshes.  The  mesh  to  which  a  quantity  belongs  is  denoted  by  a  superscript 
1  or  2.  Let  T  interpolate  either  a  node-centered  or  a  cell-centered  quantity.  For  example,  we  can 
interpolate  the  solution  u 1  on  the  first  mesh  to  the  second  mesh: 

tfi-Iu1.  (4.4) 

Similarly,  we  can  interpolate  cp2  <—  IcjA .  We  refer  to  (4.4)  as  the  first  interpolation  rule.  Both 
node-centered  and  cell-centered  quantities  may  appear  in  a  problem;  we  use  the  same  symbol  T  for 
both  cases,  though  the  underlying  implementation  of  T  depends  on  the  case. 

To  interpolate  <9ai  H(u)  requires  several  extra  steps.  Because  u  does  not  vary  on  dll  —and,  more 
generally,  boundary  conditions  are  such  that  u  at  the  boundary  nodes  must  be  treated  separately — 
the  interpolation  operator  T  must  be  modified  to  exclude  the  boundary.  Generally,  extrapolation 
is  necessary  for  points  near  the  boundary.  Let  the  resulting  operator  be  T.  Given  two  vectors  a 
and  6,  let  a/b  be  element-wise  division  and  ab  be  element-wise  multiplication.  Let  the  operator  TZ 
restrict  the  vector  a  to  interior  nodes,  and  let  the  operator  £  expand  the  vector  IZa  to  all  nodes. 
The  latter  operator  combines  the  data  in  IZa  with  boundary  condition  data.  T  uses  only  the  interior 
nodes;  hence  we  adopt  the  convention  that  T  performs  its  own  restriction:  for  example,  T(lZa)  is 
redundant  and  is  instead  written  simply  as  Ta.  £  and  7 Z  are  the  identity  operators  when  applied  to 
cell-centered  quantities. 
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As 

SH[u]  «  Y  ~  Y 

i£  JV1  ieN2 

and  recalling  8^11  (u)  ~  and  (4.3), 

du*H2{u2)  «  £(n(g.2)i(dulH1(u1)/fi1)). 

Consequently,  the  interpolation  rule  for  the  gradient  g  =  daH(u)  is 
9 2  -  : 

in  words,  divide  the  gradient  by  the  node  weights  on  mesh  1  to  obtain  the  discretization  of  a  smooth 
quantity,  interpolate  the  result,  then  multiply  by  the  node  weights  on  mesh  2.  For  future  use,  we 
need  to  write  this  rule  slightly  more  generally.  Let  Xi  =  th  if  9  is  node-centered;  and  Xi  =  vi-  if 
cell-centered.  Then 

g2  <-  eyitfW /x1)). 

This  is  the  second  interpolation  rule. 

For  convenience,  we  denote  the  first  rule  by  ls\  T(v})  =  Xs(u1);  and  the  second  by  ly :  ly{g1)  = 
£{n{g2)±{g'/g})). 

As  a  check  on  our  notation,  let  us  evaluate  an  integral  on  two  meshes.  Suppose  u2  and  H2(u2) 
are  interpolated  from  ul  and  <9si  H1  (u1) ,  respectively,  according  to  the  two  interpolation  rules;  and 
each  quantity  is  node-centered.  Let  u  take  prescribed  values  on  the  boundary.  We  have 

/  u  Hu  dx  «  u2(j)2pL2  (approximation  to  the  integral  on  mesh  2) 

ie  N2 

«  ( u2)TdU2H2(u 2)  {d^H2{u2)  «  <^V) 

=  ls(u1)Tly(dai H1  (u1))  (substitution  of  the  two  interpolation  rules)  (4.5) 

=  l(u1)T£(IZ(g2)i(duiH1(u1)/g1))  (definition  of  18  and  Ty ) 

« I(u1)T£(R(92m1))  (9fii H\ux)  «  ^V1) 

«  Y^  (approximation  on  mesh  1). 

i&V1 


In  the  final  line,  we  assume  each  mesh  is  sufficiently  fine  so  that  the  approximations  to  the  integral 
on  the  two  meshes  yield  approximately  the  same  values. 
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An  example  in  one  dimension 

Consider  the  functional 

If1 

H[u )  =  -  J  ^u2  +u2xdx 

subject  to  Dirichlet  boundary  conditions  on  u.  The  first  variation  5H  for  a  variation  Su  for  which 
the  end  points  are  0  is 


If1 

5H[u\  =  -  /  Su2  +  Su2  da; 
2  7-i 


2 uSu  +  2 uxSux  dx 


i- 1 


=  J  uSu  dx  +  J  ux5ux  dx 
=  J  uSu  dx  —  J  uxx5u  dx  +  ux8u\1_l 
=  J  (u  —  uxx)5u  dx  =  J  HuSu  dx, 


where  Hu  =  u  —  uxx . 

Now  we  develop  approximations  to  the  integral.  Discretize  [— 1, 1]  by  N  cells  with  node  indices 
TV  =  {0, 1, ,  N};  xq  =  —1,  Xjv  =  1,  and  xi+i  >  Xj.  Let  the  cell  indices  be  C  =  N\  {0}.  The  term 
u 2  of  the  integrand  is  approximated  by  a  node-centered  sum;  the  term  u2,  by  a  cell-centered  sum: 


da;  «  ^  u2/j,i  =  Hi(u) 
ie  N 


dx^Y, 


iec 


'U'i  Ui— i 
Xi—i  Xi 


Vi  =  H2{u ), 


where 


1 


xi  -  x0 


if  i  =  0 


Mi  =  2  \  Xi+l  ~  Xi~1  if  0  <  *  <  N 
xn  —  xn- i  if  i  =  N 

—  Xi  Xi—  i 


and  H  =  Hx  +  //2. 

We  test  our  results  so  far  by  plotting  several  quantities  for  a  particular  function  u  discretized  on 
two  meshes  in  Figure  4.1.  Mesh  1  is  a  combination  of  20  linearly  spaced  points  and  20  randomly 
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a  h/u, 

u  ^ 


Figure  4.1:  Illustration  of  the  interpolation  rules. 

placed  points.  Mesh  2  augments  mesh  1  by  adding  40  more  randomly  placed  points.  We  use 
cubic  spline  interpolation  to  define  T.  In  the  top  panel,  all  other  quantities  are  compared  with 
Hu  =  u  —  uxx  (black):  dUi  Hl  (u1)  /  for  two  meshes  i  =  1,2  (blue  and  red,  respectively)  and 
Ty(duiH1(u1))/p2  (green).  In  the  bottom  panel,  the  latter  three  quantities  are  plotted  with  their 
denominators  removed:  these  curves  are  the  gradients  for  the  discrete  problem.  The  green  and  red 
curves  should  be  compared:  the  first  is  the  interpolation  of  the  gradient  from  mesh  1  to  2,  while  the 
second  is  the  exact  gradient  on  mesh  2. 

4.1.2  Differential  equation  constraints 

Consider  the  DECP 

inf  f{p) 

U 

subject  to  Dk[u\p]  =  0  (4.6) 

boundary  conditions, 

where  p  is  a  set  of  parameters,  /  is  a  function  of  p,  u  is  a  function  defined  over  the  domain  LI,  and 
D  is  an  operator  involving  a  differential  operator,  k  indexes  a  particular  differential  equation  in  the 
system  of  equations;  separating  the  differential  equations  may  be  necessary  for  the  discretization 
step.  The  objective  could  be  a  functional,  such  as  H[u\.  But  one  can  generally  reformulate  the 
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problem:  introduce  a  differential  equation  for  the  integrand  in  H  and  append  it  to  D;  introduce  an 
equation  that  involves  a  new  parameter  q,  appended  to  p ,  and  u  on  the  boundary;  finally,  set  the 
objective  to  a  function  of  q. 

Discretizing  (4.6)  yields  the  numerical  optimization  problem 
min  f(p) 

U 

subject  to  Dk(u,p)  =  0  (4.7) 

boundary  conditions. 

Corresponding  to  (4.6)  and  (4.7),  respectively,  are  the  Lagrangians 

C  =  f(p)  —  Y"'  /  A kDk[u]  dx  +  [boundary  condition  terms]  (4.8) 

k 

A kDk[u)  +  [boundary  condition  terms]. 

k  i 

The  index  set  over  which  the  inner  sum  occurs  in  C  depends  on  the  discretization  of  the  differential 
equations. 

C  ~  C  and,  similarly,  corresponding  terms  in  the  two  Lagrangians  are  approximately  equal. 
From  these  relationships,  we  can  obtain  the  interpolation  rules,  u  and  D[u]  are  functions  and  so  are 
interpolated  by  the  first  rule: 

u2  <—  Tsiu1) 

D2{u2)  <—  ls{D1(u1,p)). 

Each  of  the  integrals  in  (4.8)  has  the  form  of  the  integral  (4.1).  Hence  the  interpolation  rule  for 
gradients  applies.  Let  O'  be  discretizations  of  the  continuous  Lagrangian  on  meshes  i  =  1,2.  Then 
we  interpolate  by  the  second  interpolation  rule: 

4.1.3  Diagonal  B0 

The  primary  application  of  our  interpolation  rules  is  to  interpolate  the  quasi-Newton  pairs  {s,  y}  and 
the  diagonal  matrix  Bq  =  diag(d)  from  one  mesh  to  another.  In  the  context  of  the  model  problem 
(4.7),  s  =  u+  —  u  and  y  =  —  Ca,  and  so  we  set 
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Figure  4.2:  Solutions  to  the  two  minimal  surface  problems  on  intermediate  meshes. 


These  interpolations  motivate  the  shorthand  Is  and  Iy. 

We  can  determine  an  interpolation  rule  for  Bq  =  diag(d)  as  follows,  d  is  updated  by 


d+ 


d- 


0 ds )2 
sT(ds) 


(4.9) 


Consider  the  final  term,  y2/sTy.  From  (4.5),  we  know  that  sTy  is  an  approximation  to  an  integral 
and  so  varies  over  different  meshes  only  by  the  truncation  error,  y  is  interpolated  according  to  the 
rule  for  gradients.  Unfortunately,  y  itself  is  lost  if  one  has  access  only  to  d.  We  adopt  a  third 
interpolation  rule  for  this  case: 


d2  <-  £ / n1))2  =  Tv{Vd})2. 


First  the  square  root  of  ( y 1)2  is  taken;  then  the  result  is  interpolated  according  to  the  second 
interpolation  rule;  finally,  the  result  is  squared. 


4.2  Numerical  experiments 


We  test  our  interpolation  rules  on  two  types  of  problems:  the  two-dimensional  minimal  surface 
problem  and  optimal  control  problems. 
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Solver  parameters 


>H 

m 


m 

(i, 

A 

B  W  H 

E-i  >  H 

Z;  ■<  Z 

K  m  S 


#  nodes  /  #  ufn 


265  373  567  2191  4057 


Time  (s) 


/ 


54 

74 

96 

177 

242 

778.9 

45 

57 

69 

139 

181 

583.9 

45 

55 

77 

130 

174 

559.0 

54 

59 

83 

133 

187 

594.1 

45 

43 

58 

102 

124 

412.3 

45 

45 

60 

84 

112 

368.1 

45 

47 

57 

88 

110 

367.1 

45 

46 

53 

88 

110 

368.5 

45 

42 

51 

81 

106 

352.1 

/ 


/ 

/  / 
/  /  / 


Table  4.1:  Results  for  the  problem  MS-1. 


Solver  parameters 


z 

o 

m 

Ph 

Pi 

a 

H 

z 


#  nodes  /  #  ufn 


341  198  814  3232  12715 


Time  (s) 


/ 


82 

64 

159 

304 

539 

8374.5 

60 

44 

105 

190 

382 

5877.8 

58 

46 

121 

216 

407 

6290.4 

82 

48 

141 

204 

319 

4995.8 

60 

37 

94 

134 

241 

3795.6 

58 

37 

95 

144 

231 

3647.8 

60 

39 

95 

118 

198 

3084.1 

60 

38 

90 

124 

184 

2882.1 

60 

36 

83 

106 

169 

2654.1 

/ 

/  / 


4 

5 

6  /  / 

7  /  / 

8  /  / 
9  /  / 


/ 


/ 


Table  4.2:  Results  for  the  problem  MS-2. 


4.2.1  The  minimal  surface  problem 

The  minimal  surface  problem  on  a  two-dimensional  domain  minimizes  the  area  of  a  surface  subject 
to  Dirichlet  boundary  conditions  on  the  boundary: 


inf  /  ^1  +  |Vw|2  dfl 
“  Jsi 

We  solve  the  problem  using  a  finite-element  method  and  first-order  triangular  elements.  We  use 
the  Matlab  software  distmesh  [56]  to  discretize  the  domain.  The  objective  for  the  resulting 
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unconstrained  numerical  optimization  problem  is 

F(u)  =  \A  +  (Zx{u))i  +  (Zy(u))i  Vi: 

iec 

where  the  index  i  £  C  is  over  the  triangle  (cell)  indices,  u  is  the  vector  of  surface  heights  at  the 
nodes,  Vi  =  Ai  is  the  area  of  the  *th  triangle,  and  ( zx)i  and  (zy)i  are  the  components  of  the  gradient 
of  the  plane  over  that  triangle.  We  can  reorder  the  operations  so  that  the  objective  is  written 

F(u)  =  Y  Ti  (4-10) 

ieN 

for  some  set  of  nonlinear  functions  /, .  Here  the  index  i  is  over  the  node  indices  and  /Zj  is  the  sum 
of  the  areas  of  the  triangles  involving  node  i. 

Specifying  a  problem  requires  creating  a  domain  and  setting  the  Dirichlet  boundary  conditions. 
We  consider  two  problems;  we  call  them  MS-1  and  MS-2.  Their  solutions  on  intermediate  meshes 
are  shown  in  Figure  4.2. 

A  problem  is  solved  on  a  sequence  of  meshes.  The  first  mesh  is  coarse  and  uniform.  Each 
subsequent  mesh  is  finer  than  the  previous,  and  the  area  of  a  triangle  is  adapted  to  the  solution  on 
the  preceding  mesh.  Since  the  problems  are  solved  to  a  high  precision  on  each  mesh,  the  sequence  of 
meshes  is  the  same  for  all  parameter  variations.  Indeed,  to  save  time,  we  generate  the  meshes  once 
and  then  use  them  for  all  other  parameter  variations. 

We  developed  an  unconstrained  solver  in  Matlab  that  implements  the  L-BFGS  [50]  method 
with  our  variations  and  uses  the  line  search  routine  cvsrch.m  [53].  The  solver  parameters  are  all 
boolean  valued.  In  all  cases,  the  circular  buffer  can  store  10  pairs.  Varying  parameters  include  the 
following  (abbreviations  refer  to  our  tables  of  results): 

•  INTERP-SOLN.  Interpolate  the  solution  u  using  cubic  spline  interpolation. 

•  USE-LMCBD.  If  true,  use  LMCBD;  otherwise,  use  the  standard  L-BFGS  algorithm. 

•  INTERP-SY.  Interpolate  the  pairs  {s,j/}.  The  pairs  that  are  interpolated  are  acquired  when 
the  magnitude  of  the  gradient  is  the  square  root  of  the  desired  tolerance.  The  pairs  obtained 
in  the  final  iterations  tend  not  to  be  as  helpful. 

•  INTERP-DIAG.  Interpolate  the  final  diagonal  d  from  the  iteration  on  the  coarse  mesh  to  the 
fine  mesh. 

•  INIT-DIAG.  Initialize  the  diagonal  d  in  Bo  =  diag(d)  to  If  interp-diag  is  true,  then 

ignore  this  option  except  on  the  coarsest  mesh.  This  option  uses  exactly  the  same  information 
and  user  code  as  interpolating  the  QN  matrix. 
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•  SAVE-SY.  If  INTERP-QN  is  true,  retain  the  interpolated  pairs  in  a  static  buffer  and  never  discard 
them.  Consequently,  the  QN  matrix  for  a  mesh  has  up  to  10  more  pairs  in  its  static  buffer 
than  the  QN  matrix  for  the  previous  mesh.  When  computing  a  matrix-vector  product,  the 
order  of  the  pairs  is  the  static  buffer  from  oldest  to  newest,  then,  as  usual,  the  circular  buffer 
from  oldest  to  newest. 

Tables  4.1  and  4.2  show  our  results.  The  left  column  numbers  the  different  tests.  The  check 
marks  indicate  which  solver  options  are  active.  In  the  nodes  /  #  ufn”  block,  each  column  shows 
the  number  of  user  function  calls  required  to  solve  the  problem  on  the  mesh  having  the  indicated 
number  of  nodes.  The  right  column  shows  the  accumulated  time  in  seconds  to  solve  the  problem  on 
all  the  meshes. 

The  results  accord  with  the  hypothesis  that  interpolating  the  solution  and  quasi-Newton  matrix 
from  one  mesh  to  another  is  useful. 

Lines  1  and  5  are  controls.  In  line  1,  each  problem  is  solved  independently  of  the  others  and 
L-BFGS  is  used. 

Lines  2  and  3  show  that  using  LMCBD  improves  performance,  relative  to  line  1,  on  all  meshes; 
and  initializing  the  diagonal  is  useful. 

In  line  4,  the  solution  is  interpolated.  The  results  shown  in  all  subsequent  lines  should  be 
compared  with  this  one. 

Lines  5  and  6  are  to  line  4  as  lines  2  and  3  are  to  1;  and  the  two  sets  of  results  have  similar 
behavior. 

In  line  7,  the  diagonal  matrix  Bq  is  interpolated  in  addition  to  the  solution.  Lines  6  and  7  suggest 
that  initializing  B o  to  or  interpolating  the  previous  mesh’s  Bq  are  about  equally  effective. 

Lines  8  and  9  interpolate  the  {s,y}  pairs  in  addition  to  Bq ;  line  8  simply  initializes  the  new  QN 
matrix  with  them,  while  line  9  saves  them  in  a  static  buffer  that  grows  with  each  succeeding  mesh. 

For  MS-1,  lines  6  to  9  show  about  equal  performance;  for  MS-2,  these  lines  show  performance 
that  continues  to  improve  significantly  as  more  aggressive  methods  are  applied. 

4.2.2  Optimal  control 

Next  we  implement  the  interpolation  rules  to  solve  a  set  of  optimal  control  problems.  We  examine 
four  problems: 

•  brach.  We  use  the  energy  formulation  of  the  brachistochrone  problem.  The  position  of  the 
ball  is  given  by  ( x,y ).  At  the  initial  time,  it  has  zero  velocity,  height,  and  position  in  the  x 
direction;  at  the  final  time,  it  has  height  —1  and  x  position  1.  The  ball’s  kinetic  energy  is 
Ek  =  |( x 2  +  y 2)  and  its  potential  energy  is  Ep  =  gy.  The  energy  of  the  ball  is  conserved,  and 
so  A  Ek  +  AEp  =  0.  The  ODE  relate  position  and  velocity. 

•  GLIDER.  The  glider  problem  from  COPS  3.0  [20]. 
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USE-LMCBD 

interp-diag 

#  nodes  /  #  ufn 

10  19  37  73  145  289  577  1153 

Time  (s) 

1 

99  47  61  75  100  124  135  166 

682.3 

2 

/ 

99  21  24  25  24  34  26  34 

80.0 

3 

/ 

286  31  42  59  68  76  90  73 

344.2 

4 

✓ 

/ 

286  14  16  27  45  31  55  39 

127.9 

USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  20  40  80  160  320 

Time  (s) 

1 

99  57  84  136  180  207 

108.3 

2 

/ 

99  42  39  38  55  80 

40.3 

3 

/ 

286  44  63  83  120  155 

80.5 

4 

/ 

/ 

286  21  24  22  32  40 

23.5 

Table  4.3:  Results  for  the  problem  brach. 


USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  19  37  73  145 

Time  (s) 

USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  20  40  80  160 

Time  (s) 

1 

67  61  41  64  72 

22.2  1 

67  105  249  224  269 

86.8 

2 

/ 

67  34  54  48  56 

16.7  2 

/ 

67  60  82  72  88 

28.5 

3 

/ 

34  62  32  40  73 

19.7  3 

/ 

34  43  207  204  840 

180.5 

4 

/ 

/ 

34  37  29  35  34 

12.4  4 

/ 

/ 

34  175  59  43  78 

25.4 

Table  4.4:  Results  for  the  problem  glider. 


•  ROCKET.  The  rocket  problem  from  COPS  3.0. 

•  VANDERPOL.  A  test  problem  from  SNOPT’s  test  set. 

We  discretize  all  the  problems  except  glider  using  the  trapezoid  rule.  We  believe  because  of  the 
formulation  of  the  control,  the  trapezoid  rule  for  glider  can  produce  a  problem  having  meaningless 
solutions.  As  this  issue  is  unrelated  to  our  work,  we  simply  use  the  implicit  Euler  rule  for  this 
problem.  We  present  results  for  both  uniform  and  randomly  generated  nonuniform  meshes.  The 
latter  is  meant  to  simulate  adaptive  refinement,  for  we  did  not  implement  true  adaptive  refinement 
in  these  tests. 
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USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  19  37  73  145 

Time  (s) 

1 

45  25  28  29  9 

6.5 

2 

/ 

45  11  10  29  21 

7.3 

3 

/ 

27  19  34  73  15 

10.6 

4 

/ 

/ 

27  27  34  25  8 

6.2 

Table  4.5:  Results  for 


USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  19  37  73  145 

Time  (s) 

1 

39  10  9  8  8 

3.3 

2 

/ 

39  20  18  13  17 

5.3 

3 

/ 

16  9  10  9  9 

3.2 

4 

/ 

/ 

16  12  12  14  12 

4.1 

USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  20  40  80  160 

Time  (s) 

1 

45  15  25  37  48 

14.4 

2 

/ 

45  10  11  20  16 

6.6 

3 

/ 

27  33  17  19  34 

10.3 

4 

/ 

/ 

27  22  19  21  19 

7.6 

the  problem  ROCKET. 


USE-LMCBD 

INTERP-DIAG 

#  nodes  /  #  ufn 

10  20  40  80  160 

Time  (s) 

1 

39  18  24  26  32 

9.0 

2 

/ 

39  27  30  28  51 

12.5 

3 

/ 

16  16  22  22  28 

7.8 

4 

/ 

/ 

16  14  22  19  28 

7.6 

Table  4.6:  Results  for  the  problem  VANDERPOL. 


Recall  that  the  interpolation  rules  act  on  only  the  interior  nodes.  We  set  the  value  of  a  boundary 
node  on  the  fine  mesh  to  its  value  on  the  coarse  mesh. 

We  test  four  SNOPT  parameter  variations.  We  use  either  the  original  Hessian  =  Limited  QN 
method  or  the  new  Hessian  =  LMCBD  method;  and  we  either  interpolate  just  the  solution,  or  both  the 
solution  and  Bq.  Enabling  interpolation  of  the  QN  matrix  requires  altering  SNOPT;  unfortunately, 
these  alterations  at  present  are  little  more  than  a  hack  and  so  cannot  be  released. 

Tables  4. 3-4. 6  summarize  the  results.  In  each  figure,  the  first  table  shows  results  for  the  sequence 
of  uniform  meshes;  the  second,  the  nonuniform  meshes.  Our  objective  in  performing  these  tests  is  to 
validate  the  interpolation  rules  on  generally  constrained  optimization  problems;  exact  performance 
details  are  not  of  particular  interest. 

On  the  problems  BRACH  and  glider,  LMCBD  generally  outperforms  SNOPT’s  original  QN 
method,  and  interpolating  Bq  is  beneficial.  On  the  problem  ROCKET,  interpolating  the  diagonal 
improves  performance,  but  LMCBD  and  SNOPT’s  original  method  perform  about  equally  well.  On 
the  problem  VANDERPOL,  all  variations  perform  about  equally  well. 
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4.3  Summary  and  conclusions 

We  derived  three  interpolation  rules — one  each  for  s,  y,  and  Bq  =  diag(d) — for  our  limited-memory 
quasi-Newton  method  based  on  identifying  correspondences  in  the  continuous  and  discretized  opti¬ 
mization  problems.  Then  we  tested  the  rules  on  two  2D  finite-element  problems  and  four  optimal 
control  problems.  We  found  that  the  interpolation  rules  improve  performance  (ms-1,  ms-2,  brach, 
glider,  rocket) — often  markedly  (ms-2,  brach,  glider) — or  at  least  do  not  reduce  it  (vander- 
pol). 

Unlike  the  methods  reported  in  Chapters  2  and  3,  these  interpolation  rules  cannot  be  imple¬ 
mented  as  a  black  box;  they  require  problem-dependent  information.  One  application  might  be  to 
implement  these  rules  within  optimal  control  or  PDE-constrained  packages  that  implement  partic¬ 
ular  discretization  methods.  Though  a  package’s  developer  would  have  to  implement  the  method, 
the  package’s  users  would  not. 
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